Skip to main content

Development

How to Configure Eclipse for Spark Application in the Cluster

Spark provides several ways for developer and data scientists to load, aggregate and compute data and return a result. Many Java or Scala developers would prefer to write their own application codes (aka Driver program) instead of inputting a command into the built-in spark shell or python interface. Below are some steps for how to quickly configure your IDE in order to write Spark application code.

Download Eclipse and Install your IDE

A web search will show you that Intellij IDEA is a very popular tool for a Scala program while the Eclipse with Scala plug-in is also nice. You can download from www.scala-ide.org and extract to your local folder, and you are good to go. Version 4.3.0 supports both Scala and Java languages.

Understanding the Application Submission Flow

Let’s take a look at the Spark application submission flow to understand how the request is processed. Following is the official diagram from Spark. Every application code or piece of logic will be submitted via SparkContext to the Spark cluster.

The sequence is:

1.When you submit your request with Scala/Python or wit Java, the SparkContext will connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.

2.Once connected, Spark was assigned with executors on nodes in the cluster, which are processes that run computations and store data for your application.

3.It sends your application code (defined by JAR or even Python files passed to SparkContext) to the executors. Then, SparkContext sends tasks for the executors to run and returns the results.

How to Configure Eclipse for Spark Application in the Cluster

As noted, the Spark engine can be configured as a standalone Mesos or MapReduce. In our example we use the standalone mode which is supported by the in-memory Spark core framework. You can definitely install Scala IDE on the Master node and write the application on that machine then submit to the submission, but in most cases we have multiple developers and each of them wants to write code in their own Windows.

Configure Development Environment

The following section will show you the basic configuration steps in Eclipse.

  1. Create a new project in the Eclipse named SparkAppTest;

2. Add needed Spark-assembly-*.jar into your build path. This jar file should be found in your Spark downloaded package. Of course, you are able to download all Spark code and compile it by yourself;

How to Configure Eclipse for Spark Application in the Cluster

3. The following code snippet should demonstrate how to set the parameters. The parameter for setMaster should be exactly the same with your Spark master title appearing in your master admin console. The parameter “spark.driver.host” is your local IP address. The parameter for setJars is packed Jar file that will be distributed across data nodes and run from there. The highlighted blue part is the Spark app core implementation.

How to Configure Eclipse for Spark Application in the Cluster

4. You can manually compile this Scala class and build into a Jar file which path should match with the above path in the setJars function. My way is to add an Ant xml file to automate the compilation and jar file packing.

5. Execute your class as a Scala application, therefore your driver program has submitted code to the cluster and trigger task on the nodes. If you monitor the console, you will see something like:

How to Configure Eclipse for Spark Application in the Cluster

As long as you prepare your Eclipse, you are ready to write more Scala (or Java) examples to operate over RDD to do aggregation, reduce and transformation.

Thoughts on “How to Configure Eclipse for Spark Application in the Cluster”

  1. Hi Kent,

    This is very good article, I tried this. But giving an error .. as below

    [INFO] Compiling 2 source files to /home/cloudera/workspace/sparkWorld/target/classes at 1437725529015
    [ERROR] /home/cloudera/workspace/sparkWorld/src/main/scala/org/onespark/sparkWorld/HelloWorldApp.scala:3: error: value apache is not a member of package org
    [INFO] import org.apache.spark._
    [INFO] ^
    [ERROR] /home/cloudera/workspace/sparkWorld/src/main/scala/org/onespark/sparkWorld/HelloWorldApp.scala:13: error: value split is not a member of Char
    [INFO] val wfm = “First Spark Program A beautiful Spark program”.flatMap(_.split(“\\W+”))
    [INFO] ^

    looks like it is not recognizing my spark env..

    Any help would be great.

    Thanks
    Siva

  2. Hi Siva,
    Thanks for your question. Most likely you are not using Spark RDD, you can refer to my example to create SparkContext in your code.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us