Spark provides several ways for developer and data scientists to load, aggregate and compute data and return a result. Many Java or Scala developers would prefer to write their own application codes (aka Driver program) instead of inputting a command into the built-in spark shell or python interface. Below are some steps for how to quickly configure your IDE in order to write Spark application code.
Download Eclipse and Install your IDE
A web search will show you that Intellij IDEA is a very popular tool for a Scala program while the Eclipse with Scala plug-in is also nice. You can download from www.scala-ide.org and extract to your local folder, and you are good to go. Version 4.3.0 supports both Scala and Java languages.
Understanding the Application Submission Flow
Let’s take a look at the Spark application submission flow to understand how the request is processed. Following is the official diagram from Spark. Every application code or piece of logic will be submitted via SparkContext to the Spark cluster.
The sequence is:
1.When you submit your request with Scala/Python or wit Java, the SparkContext will connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.
2.Once connected, Spark was assigned with executors on nodes in the cluster, which are processes that run computations and store data for your application.
Choosing a Global Software Development Partner to Accelerate Your Digital Strategy
To be successful and outpace the competition, you need a software development partner that excels in exactly the type of digital projects you are now faced with accelerating, and in the most cost effective and optimized way possible.
3.It sends your application code (defined by JAR or even Python files passed to SparkContext) to the executors. Then, SparkContext sends tasks for the executors to run and returns the results.
As noted, the Spark engine can be configured as a standalone Mesos or MapReduce. In our example we use the standalone mode which is supported by the in-memory Spark core framework. You can definitely install Scala IDE on the Master node and write the application on that machine then submit to the submission, but in most cases we have multiple developers and each of them wants to write code in their own Windows.
Configure Development Environment
The following section will show you the basic configuration steps in Eclipse.
- Create a new project in the Eclipse named SparkAppTest;
2. Add needed Spark-assembly-*.jar into your build path. This jar file should be found in your Spark downloaded package. Of course, you are able to download all Spark code and compile it by yourself;
3. The following code snippet should demonstrate how to set the parameters. The parameter for setMaster should be exactly the same with your Spark master title appearing in your master admin console. The parameter “spark.driver.host” is your local IP address. The parameter for setJars is packed Jar file that will be distributed across data nodes and run from there. The highlighted blue part is the Spark app core implementation.
4. You can manually compile this Scala class and build into a Jar file which path should match with the above path in the setJars function. My way is to add an Ant xml file to automate the compilation and jar file packing.
5. Execute your class as a Scala application, therefore your driver program has submitted code to the cluster and trigger task on the nodes. If you monitor the console, you will see something like:
As long as you prepare your Eclipse, you are ready to write more Scala (or Java) examples to operate over RDD to do aggregation, reduce and transformation.