How to Load Oracle Data into SparkR Dataframe / Blogs / Perficient

In the Spark 1.4 and onward, it supplied various ways to enable user to load the external data source such as RDBMS, JSON, Parquet, and Hive file into SparkR. Ok, when we talk about SparkR, we would have to know something about R. Local data frame is a popular concept and data structure in R library. This means you can also use local DF(Data Frame) in SparkR as well. When the external data get loaded into SparkR, this can be assigned to a local DF then the user is able to perform DF operating function of grouping and aggregation.

R Vectors Data Frame

R vector is a sequence of data elements of the same basic type. For example below vector consists of 5 members and all of them are string.

c(“aa”, “bb”, “cc”, “dd”, “ee”)

return: “aa” “bb” “cc” “dd” “ee”

R Data frame is an important data structure and was also introduced in Spark. It is a list of vector with the same length. For example as below, this Data frame consists of 3 different data type vectors including number, string and Boolean value.

> n = c(2, 3, 5)

> s = c(“aa”, “bb”, “cc”)

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

> b = c(TRUE, FALSE, TRUE)

To construct a data frame

df = data.frame(n, s, b)

Example to Load Oracle tables into SparkR

As we know if we have installed the Oracle database, there are some sample tables and data for use. In the following steps I will be using the HR. Employees table to import to SparkR dataframe.

Install Oracle Driver on Spark

By default, there is no any database driver (JDBC) to be installed in Spark directory, you can download from Oracle site such as ojdbc6.jar and upload to spark master nodes. To include this extension lib you can add the line in the “spark-env.sh” file. Per my testing, it is sufficient to stage this Oracle driver file to Master node instead of spreading out to slaves.

Test Connection and Load Data

To test it, you would have to load something from Oracle instance by running following script in sparkR environment. Each row in the Oracle source would be converted to the vectors in the DF.

df <- loadDF(sqlContext, source=”jdbc”, url=”jdbc:oracle:thin:user/password@*.*.*.*:1521:orcl”, dbtable=”hr.employees”)

showDF(df)

If you want to load the Oracle tables outside of SparkR but in Spark environment, you can run another command.

val df = sqlContext.load(“jdbc”, Map( “url” -> “jdbc:oracle:thin:user/password@*.*.*.*:1521:orcl”, “dbtable” -> “hr.employees”))df.

show()

Furthermore, since you have initialized the data frame, you can do more operations over DF such as count, sum, min, max and average etc.

How to Load Oracle Data into SparkR Dataframe

by Kent Jiang on February 24th, 2016 | ~ minute read

Revolutionize Your Business With Generative AI

Tags

Leave a Reply

Kent Jiang

Categories

Follow Us