From my previous post, we may know that Spark as a big data technology is becoming popular, powerful and used by many organizations and individuals. The Spark project was written in Scala, which is a purely object-oriented and functioning language. So, what can a Java developer do if he or she wants to learn about Spark?
Here, I share my quick way of installing Apache Spark in the local machine.
Prepare Linux Box
As stated, the Spark components can be installed and deployed in Windows OS, but I believe more developers are fond of Linux operation systems such as Ubuntu, Linux Mint, CentOS etc. I am using CentOS 6.5 64-bit as a virtual machine in VMware player. Make sure you are with root privilege and have Internet access.
Install Java 6 and above
1. Access Oracle site and download appropriate JDK version for Linux;
2. Unzip it and place in the folder like /usr/java
3. Set JAVA_HOME, JRE_HOME, PATH, CLASS_PATH environmental variables. For example:
export JAVA_HOME = /usr/java/jdk1.7.0.
4. These scripts can be added into /etc/profile, so don’t have to export again every time.
Install Apache Spark
Access the Apache Spark site. At the time this post was published, the Spark 1.3.1 was available for download, so in the release type select the latest version; in the package type select Pre-built for Hadoop 2.6 or later package. Note that Spark can be installed with or without Hadoop. This example is for learning purposes, so we just deploy Spark in the standalone mode.
Download and unzip to a specific folder like /Spark. The following are the mostly access sub folders:
Folder | Usage |
/spark/bin | Contains the Spark API and scripts for python or Scala |
/spark/conf | Contains all needed configuration files |
/spark/data | It is for machine learning library |
/spark/lib | Needed jar files |
/spark/logs | Can check all startup, stop and errors |
/spark/sbin | Contains the scripts to start, stop master, slave nodes |
Start Spark Node
1. Go to sbin folder, run script to start master node:
./start-master.sh
2. Check on the Console – http://localhost:8080 to ensure Master node is working properly. Look at the web page and there should be something like “spark://localhost.localdomain:7077”
3. Go to bin folder, run script to start worker node:
./spark-class org.apache.spark.deploy.worker.Worker spark://localhost.localdomain:7077
Now you should be able to see an active worker node in the list:
To perform majority Spark operation and computation, you should be familiar with Spark interactive shell in either Python, Scala or R. For python developer, go into your Spark directory and type:
bin/pyspark
To open the Scala version of the shell, type:
bin/spark-shell
If you get your simplest local Spark environment ready (no cluster, no HDFS), it is actually a good start to learn Resilient Distributed Dataset (RDD) through which you can implement various types of computation. Also you can turn your local mode to cluster mode and scale up/down by adding/removing machines.