Posts Tagged ‘Spark’

Apache Spark: Merging and Renaming Files

Apache Spark: Merging Files using Databricks

In data engineering and analytics workflows, merging files emerges as a common task when managing large datasets distributed across multiple files. Databricks, furnishing a powerful platform for processing big data, prominently employs Scala. In this blog post, we’ll delve into how to merge files efficiently using Scala on Databricks. Introduction: Merging files entails combining the […]

Spark: RDD vs DataFrame vs Dataset

In the context of Apache Spark, RDD, DataFrame, and Dataset are different abstractions for working with structured and semi-structured data. Here’s a brief definition of each: RDD (Resilient Distributed Dataset): RDD is the basic abstraction in Spark. It represents an immutable, distributed collection of objects that can be processed in parallel across a cluster. RDDs […]

Platforms and Technology Technology Partners

Spark DataFrame: Writing into Files

This blog post explores how to write Spark DataFrame into various file formats for saving data to external storage for further analysis or sharing. Before diving into this blog have a look at my other blog posts discussing about creating the DataFrame and manipulating the DataFrame along with writing a DataFrame into tables and views. […]

Cloud Databricks Platforms and Technology Technology Partners

Spark SQL Properties

The spark.sql.* properties are a set of configuration options specific to Spark SQL, a module within Apache Spark designed for processing structured data using SQL queries, DataFrame API, and Datasets. These properties allow users to customize various aspects of Spark SQL’s behavior, optimization strategies, and execution environment. Here’s a brief introduction to some common spark.sql.* […]

Databricks Platforms and Technology

Scala: mutable data structure

Scala, a programming language that combines object-oriented and functional programming paradigms, provides a variety of mutable data structures. Mutable collections such as ArrayBuffer and HashMap facilitate in-place modifications, making them well-suited for situations demanding high-performance, mutable structures. They present a conventional alternative, providing a mutable counterpart to their immutable equivalents. All the mutable scala collections […]

Platforms and Technology

Scala: Immutable data structure

Scala, a programming language that combines object-oriented and functional programming paradigms, provides a variety of immutable data structures. Immutable data structures are those that cannot be modified after they are created, which can be beneficial for ensuring safety and simplicity in concurrent or parallel programming. Here are some commonly used immutable data structures in Scala: […]

Platforms and Technology

Date and Timestamp in Spark SQL

Spark SQL offers a set of built-in standard functions for handling dates and timestamps within the DataFrame API. These functions are valuable for performing operations involving date and time data. They accept inputs in various formats, including Date type, Timestamp type, or String. If the input is provided as a String, it must be in […]

Databricks Platforms and Technology Technology Partners

Spark DataFrame: Writing to Tables and Creating Views

In this Blog Post we will see methods of writing Spark DataFrame into tables and creating views, for essential tasks for data processing and analysis. Before diving into this blog have a look at my other blog posts discussing about creating the DataFrame and manipulating the DataFrame. Creating DataFrame: https://blogs.perficient.com/2024/01/10/spark-scala-approaches-toward-creating-dataframe/ Manipulating DataFrame: https://blogs.perficient.com/2024/02/15/spark-dataframe-basic-methods/ Dataset: The […]

Analytics Databricks Platforms and Technology

DBFS (Databricks File System) in Apache Spark

In the world of big data processing, efficient and scalable file systems play a crucial role. One such file system that has gained popularity in the Apache Spark ecosystem is DBFS, which stands for Databricks File System. In this blog post, we’ll explore into what DBFS is, how it works, and provide examples to illustrate […]

Databricks Platforms and Technology Technology Partners

Spark: DataFrame Basic Methods

DataFrame is a key abstraction in Spark which represents structured data and allows for easy manipulation and analysis. In this blog post, we’ll explore the various basic DataFrame methods available in Spark and how they can be used for data processing tasks using examples. Dataset: There are many DataFrame methods which are subclassified into Transformation […]

Databricks Platforms and Technology Technology Partners

Spark: Dataframe joins

In Apache Spark, DataFrame joins are operations that allow you to combine two DataFrames based on a common column or set of columns. Join operations are fundamental for data analysis and manipulation, particularly when dealing with distributed and large-scale datasets. Spark provides a rich set of APIs for performing various types of DataFrame joins. Import […]

Platforms and Technology Technology Partners

Spark: Parser Modes

Apache Spark is a powerful open-source distributed computing system widely used for big data processing and analytics. When working with structured data, one common challenge is dealing with parsing errors—malformed or corrupted records that can hinder data processing. Spark provides flexibility in handling these issues through parser modes, allowing users to choose the behavior that […]

Platforms and Technology Technology Partners