PySpark - Coding Standards & Best Practices / Blogs / Perficient

Purpose:

The primary objective for this document is to provide awareness and establish clear understanding of coding standards and best practices to adhere while developing PySpark components. Best Practices are any procedure that is accepted as being the most effective either by consensus or prescription. Practices can range from stylistic to in-depth design methodologies.

In an attempt to document the best practices for Databricks, it becomes essential to cover some of the key Python best practices. So, here let’s start with general Python standards and deep dive into PySpark eventually.

Code Layout:

Indentation

Use 4 spaces for each indentation level. Continuation lines should align the wrapped elements vertically as hanging indent. While using hanging indent, there should be no arguments in the first line and further indentation should be used to clearly distinguish it as a continuation line.

# CORRECT:

# Aligned with opening delimiter

foo = function_name(var_one, var_two,

var_three, var_four)

# Add 4 spaces and an extra level of indentation to distinguish the arguments with the rest

deffunction_name(

var_one, var_two,

var_three, var_four):

print(var_one)

# Hanging indent should add a level

foo = function_name(

var_one, var_two,

var_three, var_four)

# WRONG:

# Arguments on first line forbidden when not using vertical alignment.

foo = long_function_name(var_one, var_two,

var_three, var_four)

# Further indentation required as indentation is not distinguishable.

deflong_function_name(

var_one, var_two, var_three,

var_four):

print(var_one)

# CORRECT: # Aligned with opening delimiter foo = function_name(var_one, var_two, var_three, var_four) # Add 4 spaces and an extra level of indentation to distinguish the arguments with the rest def function_name( var_one, var_two, var_three, var_four): print(var_one) # Hanging indent should add a level foo = function_name( var_one, var_two, var_three, var_four) # WRONG: # Arguments on first line forbidden when not using vertical alignment. foo = long_function_name(var_one, var_two, var_three, var_four) # Further indentation required as indentation is not distinguishable. def long_function_name( var_one, var_two, var_three, var_four): print(var_one)

# CORRECT:

# Aligned with opening delimiter
foo = function_name(var_one, var_two,
                    var_three, var_four)

# Add 4 spaces and an extra level of indentation to distinguish the arguments with the rest
def function_name(
        var_one, var_two,
        var_three, var_four):
    print(var_one)

# Hanging indent should add a level
foo = function_name(
    var_one, var_two,
    var_three, var_four)

# WRONG:

# Arguments on first line forbidden when not using vertical alignment.
foo = long_function_name(var_one, var_two,
    var_three, var_four)

# Further indentation required as indentation is not distinguishable.
def long_function_name(
    var_one, var_two, var_three,
    var_four):
    print(var_one)

Adding two blank lines before functions and classes
- Top level function and classes are separated with two blank lines
- Method definitions inside class should be separated with one blank line
- Extra blank lines may be used sparingly to separate a group of related functions
- Use blank lines in functions sparingly, to indicate logical sections

# Top level function and class are separated with two blank lines

# Function definitions within a class are separated with single blank line

class MyParentClass:

deffunction_one(var_one, var_two):

print(var_one)

print(var_two)

deffunction_two(var_one, var_two):

print(var_one + var_two)

# Top level function and class are separated with two blank lines # Function definitions within a class are separated with single blank line class MyParentClass: def function_one(var_one, var_two): print(var_one) print(var_two) def function_two(var_one, var_two): print(var_one + var_two)

# Top level function and class are separated with two blank lines
# Function definitions within a class are separated with single blank line
class MyParentClass:

    def function_one(var_one, var_two):
        print(var_one)
        print(var_two)

    def function_two(var_one, var_two):
        print(var_one + var_two)

Limiting Line Lengths
- Limit all lines to a maximum of 79 characters
- For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be restricted to 72 characters
Imports
- Imports should usually be in separate lines
- Wildcard imports should be avoided, as it will not give a clear picture of what names present in namespace

# Correct:

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

import sys

import os

# Wrong:

import sys, os

from pyspark.sql import *

# Correct: from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, IntegerType, StringType import sys import os # Wrong: import sys, os from pyspark.sql import *

# Correct:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import sys
import os

# Wrong:
import sys, os
from pyspark.sql import *

Naming Conventions

Do not use ‘l’, ‘O’, ‘I’ as a single variable name. As all these characters appear similar to numeric 1 and 0 in some fonts, it may cause confusion.
Generally, it is recommended to use short names. In some cases, underscores can be used for better readability
Prefixing a single underscore (_) has support for protecting module variables
Prefixing a double underscore (__) to an instance or method makes it private to its class

Type	Naming Convention	Example
Function	Use lower case variables separated with underscores	myfunction, my_function
Variable	Use lowercase letters or word, or word separated with underscores	x, my_variable
Class	Use Pascal casing. Keep first letter of each sentence in uppercase. Do not separate words with underscores	MyClass, ProcessCustomerData
Method	Use lower case words separated with underscores	get_customer_data
Constant	Use upper case letter, word or words separated with underscore	X, LOCATION, LOCATION_COUNT
Module	Use short lower-case words separated with underscores	support_module.py
Package	Use short lower-case words without underscores	supportmodule,mypackage

General Recommendations

Comparing singletons
- Use ‘is’ while comparing singletons
- Use ‘is not’ instead of ‘not … is’

# Correct:

if foo isnotNone:

do_something()

# Wrong:

if foo == None:

do_something()

# Wrong:

ifnot foo isNone:

do_something()

# Correct: if foo is not None: do_something() # Wrong: if foo == None: do_something() # Wrong: if not foo is None: do_something()

# Correct:
if foo is not None:
    do_something()

# Wrong:
if foo == None:
    do_something()

# Wrong:
if not foo is None:
    do_something()

Avoid lambda expression

Always use ‘def’ statement instead of an assignment statement for anonymous (lambda) expressions.

# Correct:

defmultiply(x):

return2*x

# Wrong:

f = lambda x: 2*x

# Correct: def multiply(x): return 2*x # Wrong: f = lambda x: 2*x

# Correct:
def multiply(x):
    return 2*x

# Wrong:
f = lambda x: 2*x

Deriving Exceptions
- Derive exceptions from ‘Exception’ instead of ‘BaseException’
- Use explicit exception catching as much as possible. Avoid implicit exception catching.
- Keep ‘try’ section logic as simple as possible

# Correct

try:

import platform_specific_module

except ImportError:

platform_specific_module = None

else:

do_something()

# Wrong:

try:

import platform_specific_module

do_something()

except ImportError:

platform_specific_module = None

# Correct try: import platform_specific_module except ImportError: platform_specific_module = None else: do_something() # Wrong: try: import platform_specific_module do_something() except ImportError: platform_specific_module = None

# Correct
try:
    import platform_specific_module

except ImportError:
    platform_specific_module = None

else:
    do_something()


# Wrong:
try:
    import platform_specific_module
    do_something()
except ImportError:
    platform_specific_module = None

Boolean Comparison
- Booleans are already Booleans. They don’t need comparisons.
- For sequences (e.g., Lists), use the fact that the empty sequences represent false

# Correct:

if is_active_customer:

do_something()

# Wrong:

if is_active_customer == True:

do_something()

# Wrong:

if is_active_customer is True:

do_something()

# Wrong: If the list is empty, it represents FALSE. So, no need to check the length of the list

iflen(customer_list) != 0:

do_something()

# Correct: if is_active_customer: do_something() # Wrong: if is_active_customer == True: do_something() # Wrong: if is_active_customer is True: do_something() # Wrong: If the list is empty, it represents FALSE. So, no need to check the length of the list if len(customer_list) != 0: do_something()

# Correct:
if is_active_customer:
    do_something()

# Wrong:
if is_active_customer == True:
    do_something()

# Wrong:
if is_active_customer is True:
    do_something()

# Wrong: If the list is empty, it represents FALSE. So, no need to check the length of the list
if len(customer_list) != 0:
    do_something()

Databricks – Best Practices

Avoid print statements. Use logging module.
Ensure reusability of code modules throughout the file. Use existing reusable components instead of creating new functions redundantly.
While using recursive function with Spark, make sure it has appropriate break statement. Otherwise, it will lead of over-utilization of resources
Usernames, Passwords, Hostnames should not be maintained in direct python file or notebook. Sensitive information should be managed in secure vault, and it should be referenced in python file or notebook using keys.
While using SQL statements, it is recommended to assign the SQL statement to a variable. And use the variable in Spark SQL API.

# Correct:

sql_query = 'SELECT col_1, col_2, col_3 FROM table'

df_data = spark.sql(sql_query)

# Bad:

df_data = spark.sql('SELECT col_1, col_2, col_3 FROM table')

# Correct: sql_query = 'SELECT col_1, col_2, col_3 FROM table' df_data = spark.sql(sql_query) # Bad: df_data = spark.sql('SELECT col_1, col_2, col_3 FROM table')

# Correct:
sql_query = 'SELECT col_1, col_2, col_3 FROM table'
df_data = spark.sql(sql_query)

# Bad:
df_data = spark.sql('SELECT col_1, col_2, col_3 FROM table')

Refactor complex chaining of expressions. It’s recommended to apply multi-line expressions with different types, especially if they have different behaviours and context. Example is, mixing column creation or joining with selecting and filtering.

# Bad:

df = (

.select('a', 'b', 'c', 'key')

.filter(F.col('a') == 'truthiness')

.withColumn('boverc', F.col('b') / F.col('c'))

.join(df2, 'key', how='inner')

.join(df3, 'key', how='left')

.drop('c')

)

# Better (separating into steps):

# Step 1: we select and trim down the data that we need

# Step 2: we create the columns that we need to have

# Step 3: joining with other dataframes

df = (

.select('a', 'b', 'c', 'key')

.filter(F.col('a') == 'truthiness')

)

df = df.withColumn('boverc', F.col('b') / F.col('c'))

df = (

.join(df2, 'key', how='inner')

.join(df3, 'key', how='left')

.drop('c')

)

# Bad: df = ( df .select('a', 'b', 'c', 'key') .filter(F.col('a') == 'truthiness') .withColumn('boverc', F.col('b') / F.col('c')) .join(df2, 'key', how='inner') .join(df3, 'key', how='left') .drop('c') ) # Better (separating into steps): # Step 1: we select and trim down the data that we need # Step 2: we create the columns that we need to have # Step 3: joining with other dataframes df = ( df .select('a', 'b', 'c', 'key') .filter(F.col('a') == 'truthiness') ) df = df.withColumn('boverc', F.col('b') / F.col('c')) df = ( df .join(df2, 'key', how='inner') .join(df3, 'key', how='left') .drop('c') )

# Bad:
df = (
    df
    .select('a', 'b', 'c', 'key')
    .filter(F.col('a') == 'truthiness')
    .withColumn('boverc', F.col('b') / F.col('c'))
    .join(df2, 'key', how='inner')
    .join(df3, 'key', how='left')
    .drop('c')
)

# Better (separating into steps):
# Step 1: we select and trim down the data that we need
# Step 2: we create the columns that we need to have
# Step 3: joining with other dataframes
df = (
    df
    .select('a', 'b', 'c', 'key')
    .filter(F.col('a') == 'truthiness')
)
df = df.withColumn('boverc', F.col('b') / F.col('c'))

df = (
    df
    .join(df2, 'key', how='inner')
    .join(df3, 'key', how='left')
    .drop('c')
)

Use select statement to specify a schema contract. Doing a select at the beginning of the transform, or before returning, is considered a good practice. Any select should be seen as a cleaning operation that is preparing the dataframe for consumption by the next step in the transform.

Keep select statements as simple as possible. Due to common SQL idioms, allow only one function from spark.sql.function to be used per selected column, plus an optional .alias() to give it a meaningful name.

Expressions involving more than one dataframe, or conditional operations like .when() are discouraged to be used in a select, unless required for a performance reasons.

# Good:

aircraft = aircraft.select(

'aircraft_id',

'aircraft_msn',

'aircraft_type',

'operator_code',

F.col('aircraft_registration').alias('registration'),

F.col('number_of_economy_seats').cast('long'),

F.col('number_of_business_seats').cast('long'),

F.avg('staleness').alias('avg_staleness'),

F.avg('flight_hours').alias('avg_flight_hours'),

)

# Bad:

aircraft = aircraft.select(

'aircraft_id',

'aircraft_msn',

F.col('aircraft_registration').alias('registration'),

'aircraft_type',

F.avg('staleness').alias('avg_staleness'),

F.col('number_of_economy_seats').cast('long'),

F.avg('flight_hours').alias('avg_flight_hours'),

'operator_code',

F.col('number_of_business_seats').cast('long'),

)

# Good: aircraft = aircraft.select( 'aircraft_id', 'aircraft_msn', 'aircraft_type', 'operator_code', F.col('aircraft_registration').alias('registration'), F.col('number_of_economy_seats').cast('long'), F.col('number_of_business_seats').cast('long'), F.avg('staleness').alias('avg_staleness'), F.avg('flight_hours').alias('avg_flight_hours'), ) # Bad: aircraft = aircraft.select( 'aircraft_id', 'aircraft_msn', F.col('aircraft_registration').alias('registration'), 'aircraft_type', F.avg('staleness').alias('avg_staleness'), F.col('number_of_economy_seats').cast('long'), F.avg('flight_hours').alias('avg_flight_hours'), 'operator_code', F.col('number_of_business_seats').cast('long'), )

# Good:   
aircraft = aircraft.select(
    'aircraft_id',
    'aircraft_msn',
    'aircraft_type',
    'operator_code',
    F.col('aircraft_registration').alias('registration'),
    F.col('number_of_economy_seats').cast('long'),
    F.col('number_of_business_seats').cast('long'),
    F.avg('staleness').alias('avg_staleness'),
    F.avg('flight_hours').alias('avg_flight_hours'),
)

# Bad:
aircraft = aircraft.select(
    'aircraft_id',
    'aircraft_msn',
    F.col('aircraft_registration').alias('registration'),
    'aircraft_type',
    F.avg('staleness').alias('avg_staleness'),
    F.col('number_of_economy_seats').cast('long'),
    F.avg('flight_hours').alias('avg_flight_hours'),
    'operator_code',
    F.col('number_of_business_seats').cast('long'),
)

Instead of using withColumnRenamed(), use aliases. Also, instead of using withColumn() to redefine type, cast it in the select.

# Good:

df.select('key', F.col('comments').alias('num_comments'))

# Good:

df.select(F.col('comments').cast('double'))

# Bad:

df.select('key', 'comments').withColumnRenamed('comments', 'num_comments')

# Bad:

df.select('comments').withColumn('comments', F.col('comments').cast('double'))

# Good: df.select('key', F.col('comments').alias('num_comments')) # Good: df.select(F.col('comments').cast('double')) # Bad: df.select('key', 'comments').withColumnRenamed('comments', 'num_comments') # Bad: df.select('comments').withColumn('comments', F.col('comments').cast('double'))

# Good:
df.select('key', F.col('comments').alias('num_comments'))

# Good:
df.select(F.col('comments').cast('double'))

# Bad:
df.select('key', 'comments').withColumnRenamed('comments', 'num_comments')

# Bad:
df.select('comments').withColumn('comments', F.col('comments').cast('double'))

In places where an empty column is required to be added to satisfy the schema, always use F.lit(None) function for populating empty column. Never use an empty string or other value that represent empty like “NA”, “N/A”, “Nil”.

Though it is semantically right to use so, the primary reason for recommending to use F.lit(None) is to preserve the ability to use utilities like isNull, instead of verifying empty string, “NA”, “N/A”, “Nil”.

# Good:

df = df.withColumn('foo', F.lit(None))

# Bad:

df = df.withColumn('foo', F.lit(''))

# Bad:

df = df.withColumn('foo', F.lit('NA'))

# Good: df = df.withColumn('foo', F.lit(None)) # Bad: df = df.withColumn('foo', F.lit('')) # Bad: df = df.withColumn('foo', F.lit('NA'))

# Good:
df = df.withColumn('foo', F.lit(None))

# Bad:
df = df.withColumn('foo', F.lit(''))

# Bad:
df = df.withColumn('foo', F.lit('NA'))

User-Defined Functions (UDF)

It is highly recommended to avoid UDFs in all situations, as it is less performant than native PySpark. In most cases, logic that necessitate a UDF can be refactored to get of UDF and use only native PySpark API.

Joins

It is recommended to be more cautious when using joins. For instance, when we perform a left join and the right-side table has multiple matches for a key, then the row will be duplicated as many times as there are matches. This will heavily impact the output of transformation job.

Always specify the how keyword explicitly, even if you are doing default inner join.

# Good:

telemetry = telemetry.join(sensor_data, 'vehicle_id', how='inner')

# Bad:

telemetry = telemetry.join(sensor_data, 'vehicle_id')

# Bad:

telemetry = telemetry.join(sensor_data, 'vehicle_id', 'inner')

# Good: telemetry = telemetry.join(sensor_data, 'vehicle_id', how='inner') # Bad: telemetry = telemetry.join(sensor_data, 'vehicle_id') # Bad: telemetry = telemetry.join(sensor_data, 'vehicle_id', 'inner')

# Good:
telemetry = telemetry.join(sensor_data, 'vehicle_id', how='inner')

# Bad:
telemetry = telemetry.join(sensor_data, 'vehicle_id')

# Bad:
telemetry = telemetry.join(sensor_data, 'vehicle_id', 'inner')

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

Avoid using right-join. If you are about to use right-join, change the order of the dataframes in such a way to use a left-join instead. It is more intuitive since the dataframe you are doing operation on, is the one that you are performing your join around.

# Good:

telemetry = telemetry.join(sensors, on='vehicle_id', how='left')

# Bad:

sensors = sensors.join(telemetry, on='vehicle_id', how='right')

# Good: telemetry = telemetry.join(sensors, on='vehicle_id', how='left') # Bad: sensors = sensors.join(telemetry, on='vehicle_id', how='right')

# Good:
telemetry = telemetry.join(sensors, on='vehicle_id', how='left')

# Bad:
sensors = sensors.join(telemetry, on='vehicle_id', how='right')

Cache table/dataframe for re-usable tables

cache() is an Apache Spark transformation that can be used on RDD, Dataframe or Dataset when you perform more than one operation with this entity (RDD/Dataframe/Dataset). As cache() is a transformation operation, the caching operation takes place only when a Spark action (count, show, take or write) is also performed on the same dataframe, dataset or RDD in a single action,

df1 = spark.read.csv(input_path_1)

df2 = spark.read.csv(input_path_2)

df1.cache()# Transformation - Cache Dataframe df1

joined_df = df1.join(df2, df1.id==df2.id, how='inner')# Join Dataframe df1 & df2

filtered_df = joined_df.filter("id == 'ID100'")# Filter the joined Dataframe df1 for id 'ID100'

df1.count()# Call count on the cached Dataframe df1

filtered_df.show()# Show data out of filtered Dataframe filtered_df

df1 = spark.read.csv(input_path_1) df2 = spark.read.csv(input_path_2) df1.cache() # Transformation - Cache Dataframe df1 joined_df = df1.join(df2, df1.id==df2.id, how='inner') # Join Dataframe df1 & df2 filtered_df = joined_df.filter("id == 'ID100'") # Filter the joined Dataframe df1 for id 'ID100' df1.count() # Call count on the cached Dataframe df1 filtered_df.show() # Show data out of filtered Dataframe filtered_df

df1 = spark.read.csv(input_path_1)
df2 = spark.read.csv(input_path_2)
df1.cache()                                                 # Transformation - Cache Dataframe df1

joined_df = df1.join(df2, df1.id==df2.id, how='inner')      # Join Dataframe df1 & df2
filtered_df = joined_df.filter("id == 'ID100'")             # Filter the joined Dataframe df1 for id 'ID100'
df1.count()                                                 # Call count on the cached Dataframe df1
filtered_df.show()                                          # Show data out of filtered Dataframe filtered_df

In the above snippet, Dataframe df1 will be cached into memory only when df1.count() action is executed. df1.cache() does not initiate the caching operation on Dataframe df1.

df = spark.read.csv(input_file_path)

df.cache.take(10)# Calling take(10) on the dataframe, while caching it

df.count()# Call count() on the cached dataframe df

df = spark.read.csv(input_file_path) df.cache.take(10) # Calling take(10) on the dataframe, while caching it df.count() # Call count() on the cached dataframe df

df = spark.read.csv(input_file_path)
df.cache.take(10)                       # Calling take(10) on the dataframe, while caching it
df.count()                              # Call count() on the cached dataframe df

In the above snippet, Dataframe df will be cached into memory when take(10) action is executed. However, there is a catch in this action that only one partition will be cached. That is, take(10) process only 10 records and the partition associated to those 10 records, will only be cached and other partitions will not be cached. As a result, the next statement df.count() will create the dataframe df again. Rather df.cache.count() will cache the records from all partitions. Hence, it is recommended to use df.cache.count() wherever the use cases require to cache all data.

# Recommended

df = spark.table('input_table_data')

df.cache.count()# cache dataframe df

df.count()# call count return result from the cached dataframe

# Recommended df = spark.table('input_table_data') df.cache.count() # cache dataframe df df.count() # call count return result from the cached dataframe

# Recommended
df = spark.table('input_table_data')
df.cache.count()            # cache dataframe df
df.count()                  # call count return result from the cached dataframe

String Comparison

Databricks string comparison is case sensitive and cannot compare strings with different casing.

sql_query = "SELECT 'DAGSequence'='dagsequence' AS WithoutLowerUpper, \

LOWER('DAGSequence')='dagsequence' AS WithLowerCase, \

UPPER('DAGSequence')='DAGSEQUENCE' AS WithUpperCase"

df = spark.sql(sql_query)

df.show()

sql_query = "SELECT 'DAGSequence'='dagsequence' AS WithoutLowerUpper, \ LOWER('DAGSequence')='dagsequence' AS WithLowerCase, \ UPPER('DAGSequence')='DAGSEQUENCE' AS WithUpperCase" df = spark.sql(sql_query) df.show()

sql_query = "SELECT 'DAGSequence'='dagsequence' AS WithoutLowerUpper, \
    LOWER('DAGSequence')='dagsequence' AS WithLowerCase, \
    UPPER('DAGSequence')='DAGSEQUENCE' AS WithUpperCase"
df = spark.sql(sql_query)
df.show()

Result:

+———————+—————-+—————-+

|WithoutLowerUpper|WithLowerCase|WithUpperCase|

+———————+—————-+—————-+

| false| true| true|

+———————+—————-+—————-+

Table Partitioning

Delta tables in Databricks support partitioning which enhance performance. You can partition by a column if you expect data in that partition to be at least 1GB. If column cardinality is high, do not use that column for partitioning. For example, you partition by user ID and there are 1M distinct user IDs, partitioning would increase table load time. Example:

CREATE TABLE weather(

weather_date DATE,

location STRING,

location_type STRING,

temperature DOUBLE

) USING delta PARTITIONED BY(location)

(or)

CREATE TABLE weather(

location_type STRING,

temperature DOUBLE

) PARTITIONED BY(weather_date DATE,

location STRING)

CREATE TABLE weather( weather_date DATE, location STRING, location_type STRING, temperature DOUBLE ) USING delta PARTITIONED BY (location) (or) CREATE TABLE weather( location_type STRING, temperature DOUBLE ) PARTITIONED BY (weather_date DATE, location STRING)

CREATE TABLE weather(
    weather_date DATE,
    location STRING,
    location_type STRING,
    temperature DOUBLE
) USING delta PARTITIONED BY (location)

(or)

CREATE TABLE weather(
    location_type STRING,
    temperature DOUBLE
) PARTITIONED BY (weather_date DATE, 
                  location STRING)

Delta Lake performance using OPTIMIZE with ZORDER

Z-Ordering is an approach to collocate related information in the same set of files. The technique of co-locality is automatically applied by data-skipping algorithms in Delta Lake on Databricks, to greatly reduce the amount of data to be read. To Z-Order data, specify the columns to be ordered on in the ZORDER BY clause.

OPTIMIZE events

WHERE date >current_timestamp() - INTERVAL 1 day

ZORDER BY(event_type)

OPTIMIZE events WHERE date > current_timestamp() - INTERVAL 1 day ZORDER BY (event_type)

OPTIMIZE events
WHERE date > current_timestamp() - INTERVAL 1 day
ZORDER BY (event_type)

Use of Repartition hints for balancing partitions

Below are the different partitioning hint types,

COALESCE	Reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter
REPARTITION	Repartition to the specified number of partitions using the specified partitioning expressions. It takes partition number, column names, or both as parameters
REPARTITION_BY_RANGE	Repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters
REBALANCE	REBALANCE hint can be used to rebalance the query result output partitions, so that every partition is of a reasonable size. It can take column names as parameters. This hint will be useful when we need to write the result of the query to a table, to avoid too small/big files. This hint is ignored if AQE (Adaptive Query Execution) is not enabled.

Delete temporary table after notebook execution

Delete temporary tables that were created as intermediate tables during notebook execution. Deleting tables saves storage, especially if the notebook is scheduled daily.

spark.catalog.dropTempView('temp_view_name')

spark.sql('drop view temp_view_name')

spark.catalog.dropTempView('temp_view_name') spark.sql('drop view temp_view_name')

spark.catalog.dropTempView('temp_view_name')

spark.sql('drop view temp_view_name')

Use views when creating intermediate tables

If we need to create intermediate tables, use views to minimize storage usage and save costs. Views are session-oriented and will automatically remove tables from storage after the query execution. For optimal query performance, do not use joins or subqueries in views

df.createOrReplaceTempView('tmp_research_data')# this view will be available until active user session

df.createOrReplaceGlobalTempView('gv_research_data')# this view will be available until active spark session (cluster)

df.createOrReplaceTempView('tmp_research_data') # this view will be available until active user session df.createOrReplaceGlobalTempView('gv_research_data') # this view will be available until active spark session (cluster)

df.createOrReplaceTempView('tmp_research_data')         # this view will be available until active user session

df.createOrReplaceGlobalTempView('gv_research_data')    # this view will be available until active spark session (cluster)

PySpark – Coding Standards & Best Practices

by Saravanan Ponnaiah on September 30th, 2022 | ~ minute read

Purpose:

Code Layout:

Naming Conventions

General Recommendations

Databricks – Best Practices

Revolutionize Your Business With Generative AI

Tags

Leave a Reply

Saravanan Ponnaiah

Categories

Follow Us