Skip to main content

Development

How to Setup Local Standalone Spark Node

From my previous post, we may know that Spark as a big data technology is becoming popular, powerful and used by many organizations and individuals. The Spark project was written in Scala, which is a purely object-oriented and functioning language. So, what can a Java developer do if he or she wants to learn about Spark?

Here, I share my quick way of installing Apache Spark in the local machine.

Prepare Linux Box

As stated, the Spark components can be installed and deployed in Windows OS, but I believe more developers are fond of Linux operation systems such as Ubuntu, Linux Mint, CentOS etc. I am using CentOS 6.5 64-bit as a virtual machine in VMware player. Make sure you are with root privilege and have Internet access.

Install Java 6 and above

1. Access Oracle site and download appropriate JDK version for Linux;

2. Unzip it and place in the folder like /usr/java

3. Set JAVA_HOME, JRE_HOME, PATH, CLASS_PATH environmental variables. For example:

export JAVA_HOME =  /usr/java/jdk1.7.0.

4. These scripts can be added into /etc/profile, so don’t have to export again every time.

Install Apache Spark

Access the Apache Spark site. At the time this post was published, the Spark 1.3.1 was available for download, so in the release type select the latest version; in the package type select Pre-built for Hadoop 2.6 or later package. Note that Spark can be installed with or without Hadoop. This example is for learning purposes, so we just deploy Spark in the standalone mode.

Download and unzip to a specific folder like /Spark. The following are the mostly access sub folders:

Folder Usage
/spark/bin Contains the Spark API and scripts for python or Scala
/spark/conf Contains all needed configuration files
/spark/data It is for machine learning library
/spark/lib Needed jar files
/spark/logs Can check all startup, stop and errors
/spark/sbin Contains the scripts to start, stop master, slave nodes

 

Start Spark Node

1. Go to sbin folder, run script to start master node:

./start-master.sh

2. Check on the Console – http://localhost:8080 to ensure Master node is working properly. Look at the web page and there should be something like “spark://localhost.localdomain:7077”

3. Go to bin folder, run script to start worker node:

./spark-class org.apache.spark.deploy.worker.Worker spark://localhost.localdomain:7077

Now you should be able to see an active worker node in the list:

01

To perform majority Spark operation and computation, you should be familiar with Spark interactive shell in either Python, Scala or R. For python developer, go into your Spark directory and type:

bin/pyspark

To open the Scala version of the shell, type:

bin/spark-shell

If you get your simplest local Spark environment ready (no cluster, no HDFS), it is actually a good start to learn Resilient Distributed Dataset (RDD) through which you can implement various types of computation. Also you can turn your local mode to cluster mode and scale up/down by adding/removing machines.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us