Skip to main content

Development

Get R Running over YARN-based MapReduce

Out of the mathematical and statistics language and tools such as SAS, SPSS, Matlab, etc. R language is a pretty good tool which provides the environment and essential packages for statistical computing and graphics. It is free and it offers an open environment and the means to allow users to develop custom package.

In addition to R, there is an IDE called RStudio, which is a powerful and web based user interface for R. It’s also free (the advanced options and support require some licenses),  open source, and works great on Windows, Mac, and Linux. RStudio has a desktop and server version.

Untitled

We know that if we aim to perform data mining and machine learning from big data in the Hadoop environment, Mahout can be a great option which offers core algorithms like clustering, classifying and collaborative filtering. Compared to Mahout, R has its own advantages in  algorithm and computation speed. Furthermore, Mahout Community has announced that it will reject the new MapReduce algorithm implementations beginning in  May of this year.

Recently GDC China’s big data lab team established the R and RStudio in the Hadoop clustered environments. We obtained some experience and learned lessons on putting these components together. The  steps for doing this are as follows.

We are using CentOS 6.5 (64 bit) and have the cluster machines installed with Hortonworks Data Platform 2.1 which is a YARN based MapReduce framework. Make sure your machines can access the internet.

Install R Library

su -c ‘rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm’

sudo yum update

sudo yum install R

 

Install RStudio and Server

$ get from http://download2.rstudio.org/rstudio-server-0.98.507-x86_64.rpm

$ sudo yum install –nogpgcheck rstudio-server-0.98.507-x86_64.rpm

To Startup the service use: rstudio-server start

Install RHadoop

Download *.tar.gz files from https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads and save to one specific folder in CentOS Linux

plyrmr-0.3.0,rmr-3.1.2,rhdfs-1.0.8,rhbase-1.2.1

Execute the command with root privilege:

R CMD INSTALL “/{..}/rhadoop/rmr2_3.1.1.tar.gz”

Tips:

a. You should perform the execution in each data-node of the cluster to install RHadoop package

b. RStudio also provided the means to install package in its UI, but root are not allowed to log into the UI, so it is recommended to run the command.

c. All packages should be installed under usr/lib64/R, otherwise the R job will fail in the MapReduce nodes.

Set Environmental Variables

Sys.setenv(HADOOP_CMD = “/usr/bin/hadoop”)

Sys.getenv(“HADOOP_CMD”)

Sys.setenv(HADOOP_STREAMING = “/usr/lib/hadoop-mapreduce/hadoop-streaming-2.4.0.2.1.2.0-402.jar”)

Sys.getenv(“HADOOP_STREAMING”)

Test R Program in Studio

Write a simple R program to test it out. Basically you could implement the Map and Reduce function with R language respectively and then name it something like:

mapreduce(input=”/user/hadoop/hdp/in”,

          input.format=make.input.format(“csv”, sep = “,”),

          output=”/user/hadoop/hdp/out”,

          output.format=”csv”,

          map=mapper,

          reduce=reducer

)

R language is a new language to most Java and BI consultants, so it will be a new world if you step into the data mining and machine learning field. The knowledge and techniques of linear algebra, probability statistics, data visualization, Hadoop, Java are helpful to work through a RHadoop project.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us