How to Setup Local Standalone Spark Node / Blogs / Perficient

From my previous post, we may know that Spark as a big data technology is becoming popular, powerful and used by many organizations and individuals. The Spark project was written in Scala, which is a purely object-oriented and functioning language. So, what can a Java developer do if he or she wants to learn about Spark?

Here, I share my quick way of installing Apache Spark in the local machine.

Prepare Linux Box

As stated, the Spark components can be installed and deployed in Windows OS, but I believe more developers are fond of Linux operation systems such as Ubuntu, Linux Mint, CentOS etc. I am using CentOS 6.5 64-bit as a virtual machine in VMware player. Make sure you are with root privilege and have Internet access.

Install Java 6 and above

1. Access Oracle site and download appropriate JDK version for Linux;

2. Unzip it and place in the folder like /usr/java

3. Set JAVA_HOME, JRE_HOME, PATH, CLASS_PATH environmental variables. For example:

export JAVA_HOME = /usr/java/jdk1.7.0.

4. These scripts can be added into /etc/profile, so don’t have to export again every time.

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

Install Apache Spark

Access the Apache Spark site. At the time this post was published, the Spark 1.3.1 was available for download, so in the release type select the latest version; in the package type select Pre-built for Hadoop 2.6 or later package. Note that Spark can be installed with or without Hadoop. This example is for learning purposes, so we just deploy Spark in the standalone mode.

Download and unzip to a specific folder like /Spark. The following are the mostly access sub folders:

Folder	Usage
/spark/bin	Contains the Spark API and scripts for python or Scala
/spark/conf	Contains all needed configuration files
/spark/data	It is for machine learning library
/spark/lib	Needed jar files
/spark/logs	Can check all startup, stop and errors
/spark/sbin	Contains the scripts to start, stop master, slave nodes

Start Spark Node

1. Go to sbin folder, run script to start master node:

./start-master.sh

2. Check on the Console – http://localhost:8080 to ensure Master node is working properly. Look at the web page and there should be something like “spark://localhost.localdomain:7077”

3. Go to bin folder, run script to start worker node:

./spark-class org.apache.spark.deploy.worker.Worker spark://localhost.localdomain:7077

Now you should be able to see an active worker node in the list:

To perform majority Spark operation and computation, you should be familiar with Spark interactive shell in either Python, Scala or R. For python developer, go into your Spark directory and type:

bin/pyspark

To open the Scala version of the shell, type:

bin/spark-shell

If you get your simplest local Spark environment ready (no cluster, no HDFS), it is actually a good start to learn Resilient Distributed Dataset (RDD) through which you can implement various types of computation. Also you can turn your local mode to cluster mode and scale up/down by adding/removing machines.

How to Setup Local Standalone Spark Node

by Kent Jiang on May 7th, 2015 | ~ minute read

Revolutionize Your Business With Generative AI

Tags

Leave a Reply

Kent Jiang

Categories

Follow Us