In the Spark 1.4 and onward, it supplied various ways to enable user to load the external data source such as RDBMS, JSON, Parquet, and Hive file into SparkR. Ok, when we talk about SparkR, we would have to know something about R. Local data frame is a popular concept and data structure in R library. This means you can also use local DF(Data Frame) in SparkR as well. When the external data get loaded into SparkR, this can be assigned to a local DF then the user is able to perform DF operating function of grouping and aggregation.
R Vectors Data Frame
R vector is a sequence of data elements of the same basic type. For example below vector consists of 5 members and all of them are string.
c(“aa”, “bb”, “cc”, “dd”, “ee”)
return: “aa” “bb” “cc” “dd” “ee”
R Data frame is an important data structure and was also introduced in Spark. It is a list of vector with the same length. For example as below, this Data frame consists of 3 different data type vectors including number, string and Boolean value.
> n = c(2, 3, 5)
> s = c(“aa”, “bb”, “cc”)
> b = c(TRUE, FALSE, TRUE)
To construct a data frame
df = data.frame(n, s, b)
Example to Load Oracle tables into SparkR
As we know if we have installed the Oracle database, there are some sample tables and data for use. In the following steps I will be using the HR. Employees table to import to SparkR dataframe.
- Install Oracle Driver on Spark
By default, there is no any database driver (JDBC) to be installed in Spark directory, you can download from Oracle site such as ojdbc6.jar and upload to spark master nodes. To include this extension lib you can add the line in the “spark-env.sh” file. Per my testing, it is sufficient to stage this Oracle driver file to Master node instead of spreading out to slaves.
- Test Connection and Load Data
To test it, you would have to load something from Oracle instance by running following script in sparkR environment. Each row in the Oracle source would be converted to the vectors in the DF.
df <- loadDF(sqlContext, source=”jdbc”, url=”jdbc:oracle:thin:user/password@*.*.*.*:1521:orcl”, dbtable=”hr.employees”)
If you want to load the Oracle tables outside of SparkR but in Spark environment, you can run another command.
val df = sqlContext.load(“jdbc”, Map( “url” -> “jdbc:oracle:thin:user/password@*.*.*.*:1521:orcl”, “dbtable” -> “hr.employees”))df.
Furthermore, since you have initialized the data frame, you can do more operations over DF such as count, sum, min, max and average etc.