SparkR for Data Scientists / Blogs / Perficient

Although the title Data Scientist is not mentioned as often as other IT job titles, it has been in the IT world for a while and is becoming more important with the popularity of the Internet and eCommerce. What kind of skills should a data scientist have? It could be a long list, but I think a data scientist at least needs to be good at:

Storytelling on some subjects
Understanding on the structured and unstructured data
Modeling the data in a traditional way such as EDW, or a modern way like Hadoop
Statistical analysis and math knowledge
Broader knowledge on modern data platforms like Hadoop, Spark, NoSQL
Open-mindedness and a curiosity about new things

Also, to some extent, it is mandatory for this role to know about some programming languages such as Java, Perl, Python and Scala, etc. In general, it could be a combination of each core skill in analyzing, computation, math and engineering.

The language of Matlab and R are very good choices for the data statistical engineer. R language is dedicated in statistical computing and graphics that contain many built-in and customized packages. It operates on vector, data frame and array and provides many math statistical functions.

In a previous post regarding RHadoop, I talked about the RHadoop, a standalone project to bring R and Hadoop together. This enables people to make use of a distributed computing engine in the R environment.

The Spark community has released sparkR in version 1.4, a good initiative that enables data scientists to be more flexible on producing value out of data lakes with a powerful computation engine. Recently, we have upgraded our spark nodes to version 1.4, and we see the feature for R was designed similarly to other existing ones such as Java, Scala and Python.

The following are quick steps to have bite on the sparkR:

Sign into your master nodes that have been clustered with root privilege.
Install the R environment; by default, there should be no R run time env in your cluster.
Visit the folder /spark-1.4.0-bin-hadoop2.6/bin.
Execute the shell ./sparkR.
It would go into a standalone R interface and you can write R code.
Test with the example below. You can refer to API doc for further reference.

# Initialize SparkContext and SQLContext, actually this two variables have been initialized if you to use sparkR script

#sc <- sparkR.init(appName=”SparkR-DataFrame-example”)

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

#sqlContext <- sparkRSQL.init(sc)

# Create a simple local data.frame

localDF <- data.frame(name=c(“John”, “Smith”, “Sarah”), age=c(19, 23, 18))

# Convert local data frame to a SparkR DataFrame

df <- createDataFrame(sqlContext, localDF)

# Print the data frame

head(df)

name aga

1 John 19 2 Smith 23 3 Sarah 18

The local R functionality has been inherited, and many packages are available to be used directly. The only thing we can focus on is to convert data from the external data source to data frame (distributed). The external source includes traditional database via JDBC, Hive, Jason, Parquet file, etc.

SparkR for Data Scientists

by Kent Jiang on August 26th, 2015 | ~ minute read

Revolutionize Your Business With Generative AI

Tags

Leave a Reply

Kent Jiang

Categories

Follow Us