Skip to main content


SparkR for Data Scientists

Although the title Data Scientist is not mentioned as often as other IT job titles, it has been in the IT world for a while and is becoming more important with the popularity of the Internet and eCommerce. What kind of skills should a data scientist have? It could be a long list, but I think a data scientist at least needs to be good at:

  • Storytelling on some subjects01
  • Understanding on the structured and unstructured data
  • Modeling the data in a traditional way such as EDW, or a modern way like Hadoop
  • Statistical analysis and math knowledge
  • Broader knowledge on modern data platforms like Hadoop, Spark, NoSQL
  • Open-mindedness and a curiosity about new things

Also, to some extent, it is mandatory for this role to know about some programming languages such as Java, Perl, Python and Scala, etc. In general, it could be a combination of each core skill in analyzing, computation, math and engineering.

The language of Matlab and R are very good choices for the data statistical engineer. R language is dedicated in statistical computing and graphics that contain many built-in and customized packages. It operates on vector, data frame and array and provides many math statistical functions.

In a previous post regarding RHadoop, I talked about the RHadoop, a standalone project to bring R and Hadoop together. This enables people to make use of a distributed computing engine in the R environment.

The Spark community has released sparkR in version 1.4, a good initiative that enables data scientists to be more flexible on producing value out of data lakes with a powerful computation engine. Recently, we have upgraded our spark nodes to version 1.4, and we see the feature for R was designed similarly to other existing ones such as Java, Scala and Python.

The following are quick steps to have bite on the sparkR:

  1. Sign into your master nodes that have been clustered with root privilege.02
  2. Install the R environment; by default, there should be no R run time env in your cluster.
  3. Visit the folder /spark-1.4.0-bin-hadoop2.6/bin.
  4. Execute the shell ./sparkR.
  5. It would go into a standalone R interface and you can write R code.
  6. Test with the example below. You can refer to API doc for further reference.

# Initialize SparkContext and SQLContext, actually this two variables have been initialized if you to use sparkR script

#sc <- sparkR.init(appName=”SparkR-DataFrame-example”)

#sqlContext <- sparkRSQL.init(sc)


# Create a simple local data.frame

localDF <- data.frame(name=c(“John”, “Smith”, “Sarah”), age=c(19, 23, 18))


# Convert local data frame to a SparkR DataFrame

df <- createDataFrame(sqlContext, localDF)


# Print the data frame


 name aga

1  John  19    2 Smith  23    3 Sarah  18

The local R functionality has been inherited, and many packages are available to be used directly. The only thing we can focus on is to convert data from the external data source to data frame (distributed). The external source includes traditional database via JDBC, Hive, Jason, Parquet file, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Kent Jiang

Currently I was working in Perficient China GDC located in Hangzhou as a Lead Technical Consultant. I have been with 8 years experience in IT industry across Java, CRM and BI technologies. My interested tech area includes business analytic s, project planning, MDM, quality assurance etc

More from this Author

Follow Us