Apache Spark for .Net - Kick Starter / Blogs / Perficient

Apache Spark:

Apache Spark is a general-purpose distributed processing engine for analytics over large data sets – typically terabytes or petabytes of data. Apache Spark can be used for processing data in batches, real-time streams, machine learning and ad-hoc query.

Processing tasks are distributed over a cluster of nodes and data is cached in-memory to reduce the computation time.

.Net for Apache Spark:

With .Net for Apache Spark, the free, open-source and cross platform .Net support for the popular open-source big data analytics framework, we can add the capabilities of Apache Spark to big data applications using languages that we already know. .Net for Apache Spark empowers developers with .Net experience, can contribute & benefit out of big data analytics. .Net for Apache Spark provides high performance APIs for using Spark from C# and F#. With C# and F#, we can access:

Dataframe and SparkSQL for working with structured data
Spark Structured Streaming for working with streaming content
Spark SQL for writing queries with familiar SQL syntax
Machine Learning integration for faster training and prediction

.Net for Apache Spark runs on Windows, Linux and macOS using .Net Core, which is already a cross-platform framework. We can deploy our applications to all major cloud providers including Azure HDInsights, Amazon EMR Spark, Azure Databricks and Databricks on AWS. For the convenience of our discussion, we shall discuss our samples in C# language for the benefit of major audiences.

Setting up the development environment as per the instructions given in below link,

Get started with .NET for Apache Spark | Microsoft Docs

Once Spark, .Net for Apache Spark are successfully installed in Windows OS, execute below command to ensure spark is running successfully,

Command prompt(administrator)> spark-shell

Command prompt (administrator)> spark-shell

Command prompt (administrator)> spark-shell

An active spark session will be initiated successfully and ready to use as above.

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

Create a console application targeting .Net Core 3.1 framework.

Add “Microsoft.Spark” Nuget package to the project

Once the package is added, the jar libraries are listed as below in Visual Studio project.

In Program.cs class file, write the below code and save,

using System;

using Microsoft.Spark.Sql;

usingstatic Microsoft.Spark.Sql.Functions;

namespace DemoSparkConsole

{

class Program

{

staticvoidMain(string[] args)

{

SparkSession spark = SparkSession.Builder().AppName("package_tracker").GetOrCreate();

string fileName = args[0];

// Load CSV data

spark.Read().Option("header", "true").Csv(fileName).CreateOrReplaceTempView("Packages");

DataFrame sqlDf = spark.Sql("SELECT PackageId, SenderName, OriginCity, DestinationCity, Status, DeliveryDate FROM Packages");

sqlDf.Show();

spark.Stop(); // Stop spark session

}

using System; using Microsoft.Spark.Sql; using static Microsoft.Spark.Sql.Functions; namespace DemoSparkConsole { class Program { static void Main(string[] args) { SparkSession spark = SparkSession.Builder().AppName("package_tracker").GetOrCreate(); string fileName = args[0]; // Load CSV data spark.Read().Option("header", "true").Csv(fileName).CreateOrReplaceTempView("Packages"); DataFrame sqlDf = spark.Sql("SELECT PackageId, SenderName, OriginCity, DestinationCity, Status, DeliveryDate FROM Packages"); sqlDf.Show(); spark.Stop(); // Stop spark session } } }

using System;
using Microsoft.Spark.Sql;
using static Microsoft.Spark.Sql.Functions;

namespace DemoSparkConsole
{
    class Program
    {
        static void Main(string[] args)
        {
            SparkSession spark = SparkSession.Builder().AppName("package_tracker").GetOrCreate();
            string fileName = args[0];
            // Load CSV data
            spark.Read().Option("header", "true").Csv(fileName).CreateOrReplaceTempView("Packages");
            DataFrame sqlDf = spark.Sql("SELECT PackageId, SenderName, OriginCity, DestinationCity, Status, DeliveryDate FROM Packages");
            sqlDf.Show();
            spark.Stop();         // Stop spark session
        }
    }
}

Build the project and navigate to Debug path. Locate the project DLL file and microsoft-spark-3-1_2.12-2.1.0.jar that will be responsible for running spark program,

Make sure that microsoft-spark DLL version matches with the version of Microsoft Spark Worker extracted in local directory as below,

Spark command to submit the job:

Syntax:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-3-0_2.12-<Version_of_.Net_Worker>.jar dotnet <ApplicationName>.dll <input_file_name>

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-3-0_2.12-<Version_of_.Net_Worker>.jar dotnet <ApplicationName>.dll <input_file_name>

Navigate to console app output build folder path in Powershell (Admin mode) and execute below command:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-3-0_2.12-2.1.0.jar dotnet DemoSparkConsole.dll packages.csv

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-3-0_2.12-2.1.0.jar dotnet DemoSparkConsole.dll packages.csv

Input file to read: packages.csv

Output Dataframe:

Thoughts on “Apache Spark for .Net – Kick Starter”

Deepak Mathivanan June 21, 2022 at 7:48 pm

Very much helpful and understandable

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Apache Spark for .Net – Kick Starter

by Saravanan Ponnaiah on June 21st, 2022 | ~ minute read

Revolutionize Your Business With Generative AI

Thoughts on “Apache Spark for .Net – Kick Starter”

Leave a Reply

Saravanan Ponnaiah

Categories

Follow Us