Although the title Data Scientist is not mentioned as often as other IT job titles, it has been in the IT world for a while and is becoming more important with the popularity of the Internet and eCommerce. What kind of skills should a data scientist have? It could be a long list, but I think a data scientist at least needs to be good at:
- Storytelling on some subjects
- Understanding on the structured and unstructured data
- Modeling the data in a traditional way such as EDW, or a modern way like Hadoop
- Statistical analysis and math knowledge
- Broader knowledge on modern data platforms like Hadoop, Spark, NoSQL
- Open-mindedness and a curiosity about new things
Also, to some extent, it is mandatory for this role to know about some programming languages such as Java, Perl, Python and Scala, etc. In general, it could be a combination of each core skill in analyzing, computation, math and engineering.
The language of Matlab and R are very good choices for the data statistical engineer. R language is dedicated in statistical computing and graphics that contain many built-in and customized packages. It operates on vector, data frame and array and provides many math statistical functions.
In a previous post regarding RHadoop, I talked about the RHadoop, a standalone project to bring R and Hadoop together. This enables people to make use of a distributed computing engine in the R environment.
The Spark community has released sparkR in version 1.4, a good initiative that enables data scientists to be more flexible on producing value out of data lakes with a powerful computation engine. Recently, we have upgraded our spark nodes to version 1.4, and we see the feature for R was designed similarly to other existing ones such as Java, Scala and Python.
The following are quick steps to have bite on the sparkR:
- Sign into your master nodes that have been clustered with root privilege.
- Install the R environment; by default, there should be no R run time env in your cluster.
- Visit the folder /spark-1.4.0-bin-hadoop2.6/bin.
- Execute the shell ./sparkR.
- It would go into a standalone R interface and you can write R code.
- Test with the example below. You can refer to API doc for further reference.
# Initialize SparkContext and SQLContext, actually this two variables have been initialized if you to use sparkR script
#sc <- sparkR.init(appName=”SparkR-DataFrame-example”)
Choosing a Global Software Development Partner to Accelerate Your Digital Strategy
To be successful and outpace the competition, you need a software development partner that excels in exactly the type of digital projects you are now faced with accelerating, and in the most cost effective and optimized way possible.
#sqlContext <- sparkRSQL.init(sc)
# Create a simple local data.frame
localDF <- data.frame(name=c(“John”, “Smith”, “Sarah”), age=c(19, 23, 18))
# Convert local data frame to a SparkR DataFrame
df <- createDataFrame(sqlContext, localDF)
# Print the data frame
1 John 19 2 Smith 23 3 Sarah 18
The local R functionality has been inherited, and many packages are available to be used directly. The only thing we can focus on is to convert data from the external data source to data frame (distributed). The external source includes traditional database via JDBC, Hive, Jason, Parquet file, etc.