Out of the mathematical and statistics language and tools such as SAS, SPSS, Matlab, etc. R language is a pretty good tool which provides the environment and essential packages for statistical computing and graphics. It is free and it offers an open environment and the means to allow users to develop custom package.
In addition to R, there is an IDE called RStudio, which is a powerful and web based user interface for R. It’s also free (the advanced options and support require some licenses), open source, and works great on Windows, Mac, and Linux. RStudio has a desktop and server version.
We know that if we aim to perform data mining and machine learning from big data in the Hadoop environment, Mahout can be a great option which offers core algorithms like clustering, classifying and collaborative filtering. Compared to Mahout, R has its own advantages in algorithm and computation speed. Furthermore, Mahout Community has announced that it will reject the new MapReduce algorithm implementations beginning in May of this year.
Recently GDC China’s big data lab team established the R and RStudio in the Hadoop clustered environments. We obtained some experience and learned lessons on putting these components together. The steps for doing this are as follows.
We are using CentOS 6.5 (64 bit) and have the cluster machines installed with Hortonworks Data Platform 2.1 which is a YARN based MapReduce framework. Make sure your machines can access the internet.
Install R Library
su -c ‘rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm’
sudo yum update
sudo yum install R
Install RStudio and Server
$ get from http://download2.rstudio.org/rstudio-server-0.98.507-x86_64.rpm
$ sudo yum install –nogpgcheck rstudio-server-0.98.507-x86_64.rpm
To Startup the service use: rstudio-server start
Install RHadoop
Download *.tar.gz files from https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads and save to one specific folder in CentOS Linux
plyrmr-0.3.0,rmr-3.1.2,rhdfs-1.0.8,rhbase-1.2.1
Execute the command with root privilege:
R CMD INSTALL “/{..}/rhadoop/rmr2_3.1.1.tar.gz”
Tips:
a. You should perform the execution in each data-node of the cluster to install RHadoop package
b. RStudio also provided the means to install package in its UI, but root are not allowed to log into the UI, so it is recommended to run the command.
c. All packages should be installed under usr/lib64/R, otherwise the R job will fail in the MapReduce nodes.
Set Environmental Variables
Sys.setenv(HADOOP_CMD = “/usr/bin/hadoop”)
Sys.getenv(“HADOOP_CMD”)
Sys.setenv(HADOOP_STREAMING = “/usr/lib/hadoop-mapreduce/hadoop-streaming-2.4.0.2.1.2.0-402.jar”)
Sys.getenv(“HADOOP_STREAMING”)
Test R Program in Studio
Write a simple R program to test it out. Basically you could implement the Map and Reduce function with R language respectively and then name it something like:
mapreduce(input=”/user/hadoop/hdp/in”,
input.format=make.input.format(“csv”, sep = “,”),
output=”/user/hadoop/hdp/out”,
output.format=”csv”,
map=mapper,
reduce=reducer
)
R language is a new language to most Java and BI consultants, so it will be a new world if you step into the data mining and machine learning field. The knowledge and techniques of linear algebra, probability statistics, data visualization, Hadoop, Java are helpful to work through a RHadoop project.