Skip to main content

Platforms and Technology

Spark: DataFrame Basic Methods

Spark DataFrame Method

DataFrame is a key abstraction in Spark which represents structured data and allows for easy manipulation and analysis. In this blog post, we’ll explore the various basic DataFrame methods available in Spark and how they can be used for data processing tasks using examples.

Dataset:

Dataset for Dataframe Methods

There are many DataFrame methods which are subclassified into Transformation and Action based upon the operation performed, you can learn more about these in this blog – Spark RDD Operations (perficient.com). Let’s see some of the basic and regularly used ones. Most of them needs to be imported from spark packages. In the examples the import statements can be found at the top and the ones that we are going to see below are almost SQL functions which are being incorporated on a DataFrame.

Viewing the Data with in the DataFrame:

show:

DataFrame Show Method

.show() is to view the data from the dataframe. This can be used when using from the Notebook or when running a job too.

display:

DataFrame Display Method

display() can be used when we need to view the data in a notebook this can’t be used when running in a job.

Selecting and Filtering from the DataFrame:

select:

DataFrame Select Method

.select() method can be used to select the columns that we need from the dataframe, the above sample shows a selection of one particular column from the dataframe. We can select multiple columns by adding to the select statement like

df.select("country","capital")

head:

DataFrame Head Method

.head() method can be utilized to get the first row from the dataframe which can be used when we need some maximum or minimum values of a particular column.

take:

DataFrame Take Method

.take() method provides us with the mentioned rows from the top of the DataFrame.

tail:

DataFrame Tail Method

.tail() method provides us with the mentioned rows from the bottom of the DataFrame.

filter:

DataFrame Filter Method

.filter() method can be used to filter the column based upon the column. We can provide the syntax in the sql fashion itself within the filter method.

drop:

DataFrame Drop Method

.drop() method can be used to drop a particular column or a set of columns from the dataframe. The syntax for dropping multiple columns would be –

df.drop("column1","column2")

DataFrame Count:

count:

DataFrame Count Method

.count method gives us the count of the DataFrame. We can also use select and distinct methods to find the distinct count of a particular column with this syntax

df.select("column_name").distinct.count

Data Manipulation on DataFrame:

withColumn:

DataFrame withColumn Method

.withColumn() is to add or modify an existing column with a transformation. If the column is existing the transformation will be applied to that column itself. If the column in not present, there will be a new column that will be added at the end of the dataframe with the transformation.

withColumnRenamed:

DataFrame withColumnRenamed

.withColumnRenamed() is to modify the name of the existing column to a new column where the column name needs to be transformed will be given first and the desired output column will be at the last.

upper:

DataFrame Upper Method

upper() is to change the cases of a column to uppercase.

lower:

DataFrame Lower Method

lower() is to change the cases of a column to lowercase.

lit:

DataFrame Lit Method

lit() can be used to add or modify a column with hardcoded value.

cast:

DataFrame Cast Method

cast() method is to change the datatype of a column from one datatype to another.

na.fill:

DataFrame na.fill

.na.fill(“”) is used to fill the null columns with values.

substring:

Spark DataFrame Substring

substring() is used to substring a column.

length:

Spark DataFrame Length

length() is to determine the length of the column.

concat:

Spark DataFrame Concat

concat() is to add two strings together.

trim:

Spark DataFrame Trim

trim() can be used to remove the spaces in the column. If a particular side alone needs to be trimmed then rtrima and ltrim can be used.

Distinct and Union on DataFrame:

union:

Spark DataFrame Union

unoin() in DataFrame will give you all the values from both the DataFrame unlike the SQL union where it would result only in distinct values.  To get the distinct values we can use. distinct function.

distinct:

Spark DataFrame Distinct

distinct() in DataFrame will return the distinct rows from the dataframe.

Sorting and Ordering on DataFrame:

orderBy:

Spark DataFrame Orderby

orderBy can be used on a DataFrame to order by on a particular column. By Default, Spark will order by in ascending we can explicitly call out .desc to order by descending.

Optimization and Performance:

repartition:

Spark DataFrame Repartition

You can use repartition to increase or decrease the number of partitions within the DataFrame. For more information about this, you can find it in the blog titled Spark Partition: An Overview / Blogs / Perficient

There are further optimization methods like persist and cache which you can find more about in this blog: Spark: Persistence Storage Levels / Blogs / Perficient

Joins:

The DataFrames can be joined with one another using the below syntax:

df1.join(df2,df1("col1") == df2("col2"),"inner")

Find more about DataFrame joins in this blog post titled: Spark: Dataframe joins / Blogs / Perficient

References:

Official Spark Documentation: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html

The Databricks DataFrame Guide is available at https://www.databricks.com/spark/getting-started-with-apache-spark/dataframes

In Conclusion this blog provides some basic methods along with its syntax and examples.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Gowtham Ramadoss Baskaran

Gowtham holds the role of Technical Consultant at Perficient, specializing as a Databricks Spark Developer. He is proficient in technologies like SQL, Databricks, Spark, Scala, and Java, so he actively pursues new knowledge to bolster his productivity. He works diligently in various roles to contribute and give back to the community.

More from this Author

Follow Us