Big Data Articles / Blogs / Perficient

Sushma Kulkarni Inspires Client Success with Passion, Integrity, and Leadership Excellence

Katie Armbruster — Tue, 23 Sep 2025 16:10:53 +0000

Perficient unites the brightest minds to help the world’s most admired brands embrace AI-driven innovation with trust and care. Sushma Kulkarni, a delivery lead and solutions architect in Perficient’s Data and Analytics Delivery practice, brings a mindset of continuous learning and strategic problem solving to every client engagement. Over the past year and a half, she has built trust as a leader through empathy and transparency, effectively bridging the gap between business stakeholders and technical teams. Continue reading to discover how her diverse industry experience and keen interest in emerging technologies have shaped her leadership approach.

What is your role? Describe a typical day in your life. 

As a delivery lead and solutions architect, I wear multiple hats. I am a project manager and a scrum master, ensuring that we deliver on our promises to clients. My job is well-aligned with what I like to do. My managers and leaders support my professional growth, and I’ve been pursuing certifications consistently for the past 20 months since joining Perficient. When managing projects, I dive into whatever technology is needed to stay ahead of the curve. This lets me relate to my team, appreciate their work, and see the challenges from their point of view. As a project manager, I believe my role goes beyond asking for updates. It’s about understanding my team’s obstacles and making their jobs easier and fairer.

How does your finance background apply to your current role?

As you grow, you become a well-rounded professional. It helps to look at problems from various perspectives. My experience spans roles as an accountant, auditor, and SOX auditor, managing a wide range of responsibilities throughout my career. I started as a chartered accountant and became a certified public accountant (CPA) after moving to the U.S. Along the way, I earned several certifications, including the Certified Information Systems Auditor (CISA).

Eventually, I transitioned into the IT industry as a business analyst because I was working closely with IT and business teams. I became a middleman who could translate business requirements into technical language. For this role, I learned many new technologies through training and certifications. This background gives me a unique perspective because I can understand the challenges of both business users and IT teams, and I translate between them well.

Whether big or small, how do you make a difference for our clients, colleagues, communities, or teams? 

I make a difference by understanding our clients’ needs and exceeding expectations. A key practice I follow is discussing work internally with my team before taking anything to the client. This review helps catch most issues early. It’s always good to have a different set of eyes review, as quality assurance (QA) helps here. When I encourage my colleagues to review each other’s work, I ask them if someone else has already checked it. Often, the second reviewer points out something that was missing.

By the time we meet with the client, all work is thoroughly reviewed. Clients often say their job is easier because we’ve already done the legwork. We provide them with screenshots of our internal testing as evidence. Coming from an auditing background, I believe that doing the work is not enough. We should also be able to prove it. Showing the clients what we did, including the defects we found and fixed, builds confidence and reassures clients that nothing was overlooked.

What are your proudest accomplishments, personally and professionally? Any milestone moments at Perficient? 

The first data transformation project I worked on at Perficient involved Databricks, which I initially knew very little about. Before my client interview, I spent an entire week preparing and took some training. The interview went well, and it was a great moment for me. They asked me a question I didn’t know the answer to, and I was honest in my response. Since I was applying for a project manager role, the interviewers didn’t expect me to know the intricacies of the technology. They appreciated my honesty and offered me the role.

As a takeaway, I decided to learn Databricks. Within two months, I earned my first Databricks certification. Since then, I have completed four certifications. Passing the interview and delivering the project successfully were proud moments for me.

What has your experience at Perficient taught you? 

My managers and leaders encourage me to take training during office hours and are super supportive of my professional development. I find this unique about Perficient because I enjoy learning and growing in my role. My managers motivate me, and I also encourage training and knowledge sharing among my teammates. This enthusiasm is contagious—everyone wants to participate, which is rewarding because we all benefit from it.

I’m also an early riser and like to start my day by connecting with my colleagues personally. What sets Perficient apart is our appreciation for our colleagues’ time; no one is expected to work beyond certain hours. Our people and culture truly respect our global colleagues, no matter their location.

Why did you choose Perficient? What keeps you here?

I joined Perficient because of my previous manager, who valued my work, so I felt at home right from the start. Everyone I’ve met is professional and yet personal. I can openly discuss problems with my teammates. I’m adaptive and flexible with them, making sure that no one feels pressured by their workload. When needed, we step in to help and offload tasks within the team. We are committed to timely delivery but can still accommodate our people.

How have you grown your career at Perficient? 

I’ve earned several Databricks certifications—including Data Analyst Associate, Data Engineer Associate, Gen AI Engineer Associate, and Machine Learning Engineer Associate—to gain a broad understanding of upcoming technology. My recent project involved Azure, so I did the Microsoft Azure Fundamentals certification. Last year, I also completed the Certified SAFe® Agilist. Recently, I completed the Registered Product Owner certification.

Beyond certifications, I regularly train to stay aligned with projects. This helps me communicate effectively with my team, understand their work, and contribute meaningfully. I have grown through the different projects I’m involved in, and working in our Data and Analytics Delivery team has encouraged me to learn beyond any single technology through collaboration. My latest growth area has been in AI and machine learning, which is quite exciting.

How have your AI skills contributed to your growth?

AI has helped me save time on simple tasks, like uploading meeting transcripts and asking Scarlett or Microsoft Copilot to create a summary or draft an email for me. If I’m creating a presentation, I’ll ask AI to draft slides for me. I’ve also used AI to create project tasks and support our project management work. It’s not perfect, but it saves a lot of time—so why not use the technology? There are so many things it can do.

With our leadership embracing an AI-first approach, I want to be using these technologies. My goal is to be equipped with AI and use it more often. That way, we are not just working hard but also working smart.

LEARN MORE: Revolutionizing Work With Microsoft Copilot: A Game-Changer in AI Integration

If you had to define yourself using one Perficient value, which would it be and why?

I’d pick integrity and collaboration. Collaboration isn’t just with our teammates—it includes clients too; when I say, “my team,” I mean both Perficient and the client. Integrity means I’m upfront about what’s possible, and I deliver honestly.

What are you passionate about outside of work? 

I’m very fond of pets and animals. I like to go on long walks and visit animal sanctuaries. Along the way, I enjoy the beauty of nature—watching the trees, leaves, and flowers. I’m always happy to see the animals, whether they’re pets or wildlife.

I also love music, mainly North Indian classical music, but I enjoy other genres too. We’re on screens almost all the time, so stepping away to listen to music is calming.

SEE MORE PEOPLE OF PERFICIENT

It’s no secret our success is because of our people. No matter the technology or time zone, our colleagues are committed to delivering innovative, end-to-end digital solutions for the world’s most innovative companies, and we bring a collaborative spirit to every interaction. We’re always seeking the best and brightest to work with us. Join our team and experience a culture that challenges, champions, and celebrates our people.

Learn more about what it’s like to work at Perficient at our Careers page. See open jobs or join our talent community for career tips, job openings, company updates, and more!

Go inside Life at Perficient and connect with us on LinkedIn, YouTube, X, Facebook, and Instagram.

Mastering Databricks Jobs API: Build and Orchestrate Complex Data Pipelines

Prasad Sogalad — Fri, 06 Jun 2025 18:45:09 +0000

In this post, we’ll dive into orchestrating data pipelines with the Databricks Jobs API, empowering you to automate, monitor, and scale workflows seamlessly within the Databricks platform.

Why Orchestrate with Databricks Jobs API?

When data pipelines become complex involving multiple steps—like running notebooks, updating Delta tables, or training machine learning models—you need a reliable way to automate and manage them with ease. The Databricks Jobs API offers a flexible and efficient way to automate your jobs/workflows directly within Databricks or from external systems (for example AWS Lambda or Azure Functions) using the API endpoints.

Unlike external orchestrators such as Apache Airflow, Dagster etc., which require separate infrastructure and integration, the Jobs API is built natively into the Databricks platform. And the best part? It doesn’t cost anything extra. The Databricks Jobs API allows you to fully manage the lifecycle of your jobs/workflows using simple HTTP requests.

Below is the list of API endpoints for the CRUD operations on the workflows:

Create: Set up new jobs with defined tasks and configurations via the POST /api/2.1/jobs/create Define single or multi-task jobs, specifying the tasks to be executed (e.g., notebooks, JARs, Python scripts), their dependencies, and the compute resources.
Retrieve: Access job details, check statuses, and review run logs using GET /api/2.1/jobs/get or GET /api/2.1/jobs/list.
Update: Change job settings such as parameters, task sequences, or cluster details through POST /api/2.1/jobs/update and /api/2.1/jobs/reset.
Delete: Remove jobs that are no longer required using POST /api/2.1/jobs/delete.

These full CRUD capabilities make the Jobs API a powerful tool to automate job management completely, from creation and monitoring to modification and deletion—eliminating the need for manual handling.

Key components of a Databricks Job

Tasks: Individual units of work within a job, such as running a notebook, JAR, Python script, or dbt task. Jobs can have multiple tasks with defined dependencies and conditional execution.
Dependencies: Relationships between tasks that determine the order of execution, allowing you to build complex workflows with sequential or parallel steps.
Clusters: The compute resources on which tasks run. These can be ephemeral job clusters created specifically for the job or existing all-purpose clusters shared across jobs.
Retries: Configuration to automatically retry failed tasks to improve job reliability.
Scheduling: Options to run jobs on cron-based schedules, triggered events, or on demand.
Notifications: Alerts for job start, success, or failure to keep teams informed.

Getting started with the Databricks Jobs API

Before leveraging the Databricks Jobs API for orchestration, ensure you have access to a Databricks workspace, a valid Personal Access Token (PAT), and sufficient privileges to manage compute resources and job configurations. This guide will walk through key CRUD operations and relevant Jobs API endpoints for robust workflow automation.

1. Creating a New Job/Workflow:

To create a job, you send a POST request to the /api/2.1/jobs/create endpoint with a JSON payload defining the job configuration.

{
  "name": "Ingest-Sales-Data",
  "tasks": [
    {
      "task_key": "Ingest-CSV-Data",
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 30 9 * * ?",
    "timezone_id": "UTC",
    "pause_status": "UNPAUSED"
  },
  "email_notifications": {
    "on_failure": [
      "name@email.com"
    ]
  }
}

This JSON payload defines a Databricks job that executes a notebook-based task on a newly provisioned cluster, scheduled to run daily at 9:30 AM UTC. The components of the payload are explained below:

name: The name of your job.
tasks: An array of tasks to be executed. A job can have one or more tasks.
- task_key: A unique identifier for the task within the job. Used for defining dependencies.
- notebook_task: Specifies a notebook task. Other task types include spark_jar_task, spark_python_task, spark_submit_task, pipeline_task, etc.
  - notebook_path: The path to the notebook in your Databricks workspace.
  - source: The source of the notebook (e.g., WORKSPACE, GIT).
- new_cluster: Defines the configuration for a new cluster that will be created for this job run. You can also use existing_cluster_id to use an existing all-purpose cluster (though new job clusters are recommended).
  - spark_version, node_type_id, num_workers: Standard cluster configuration options.
schedule: Defines the job schedule using a cron expression and timezone.
email_notifications: Configures email notifications for job events.

To create a Databricks workflow, the above JSON payload can be included in the body of a POST request sent to the Jobs API’s create endpoint—either using curl or programmatically via the Python requests library as shown below:

Using Curl:

curl -X POST \
  https://.cloud.databricks.com/api/2.1/jobs/create \
  -H "Authorization: Bearer " \
  -H "Content-Type: application/json" \
  -d '@workflow_config.json' #Place the above payload in workflow_config.json

Using Python requests library:

import requests
import json
create_response = requests.post("https://.cloud.databricks.com/api/2.1/jobs/create", data=json.dumps(your_json_payload), auth=("token", token))
if create_response.status_code == 200:
    job_id = json.loads(create_response.content.decode('utf-8'))["job_id"]
    print("Job created with id: {}".format(job_id))
else:
    print("Job creation failed with status code: {}".format(create_response.status_code))
    print(create_response.text)

The above example demonstrated a basic single-task workflow. However, the full potential of the Jobs API lies in orchestrating multi-task workflows with dependencies. The tasks array in the job payload allows you to configure multiple dependent tasks.
For example, the following workflow defines three tasks that execute sequentially: Ingest-CSV-Data → Transform-Sales-Data → Write-to-Delta.

{
  "name": "Ingest-Sales-Data-Pipeline",
  "tasks": [
    {
      "task_key": "Ingest-CSV-Data",
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    },
    {
      "task_key": "Transform-Sales-Data",
      "depends_on": [
        {
          "task_key": "Ingest-CSV-Data"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/transform_sales_data",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    },
    {
      "task_key": "Write-to-Delta",
      "depends_on": [
        {
          "task_key": "Transform-Sales-Data"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 30 9 * * ?",
    "timezone_id": "UTC",
    "pause_status": "UNPAUSED"
  },
  "email_notifications": {
    "on_failure": [
      "name@email.com"
    ]
  }
}

2. Updating Existing Workflows:

For modifying existing workflows, we have two endpoints: the update endpoint /api/2.1/jobs/update and the reset endpoint /api/2.1/jobs/reset. The update endpoint applies a partial update to your job. This means you can tweak parts of the job — like adding a new task or changing a cluster spec — without redefining the entire workflow. While the reset endpoint does a complete overwrite of the job configuration. Therefore, when resetting a job, you must provide the entire desired job configuration, including any settings you wish to keep unchanged, to avoid them being overwritten or removed entirely. Let us go over a few examples to better understand the endpoints better.

2.1. Update Workflow Name & Add New Task:

Let us modify the above workflow by renaming it from Ingest-Sales-Data-Pipeline to Sales-Workflow-End-to-End, adding an input parametersource_location to the Ingest-CSV-Data, and introducing a new task Write-to-Postgres, which runs after the successful completion of Transform-Sales-Data.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3:///"
          },
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

2.2. Update Cluster Configuration:

Cluster startup can take several minutes, especially for larger, more complex clusters. Sharing the same cluster allows subsequent tasks to start immediately after previous ones complete, speeding up the entire workflow. Parallel tasks can also run concurrently sharing the same cluster resources efficiently. Let’s update the above workflow to share the same cluster between all the tasks.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "job_clusters": [
      {
        "job_cluster_key": "shared-cluster",
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3:///"
          },
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

2.3. Update Task Dependencies:

Let’s add a new task named Enrich-Sales-Data and update the dependency as shown below:Ingest-CSV-Data →Enrich-Sales-Data → Transform-Sales-Data →[Write-to-Delta, Write-to-Postgres].Since we are updating dependencies of existing tasks, we need to use the reset endpoint /api/2.1/jobs/reset.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "job_clusters": [
      {
        "job_cluster_key": "shared-cluster",
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3:///"
          },
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Enrich-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/enrich_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Enrich-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

The update endpoint is useful for minor modifications like updating the workflow name, updating the notebook path, input parameters to tasks, updating the job schedule, changing cluster configurations like node count etc., while the reset endpoint should be used for deleting existing tasks, redefining task dependencies, renaming tasks etc.
The update endpoint does not delete tasks or settings you omit i.e. tasks not mentioned in the request will remain unchanged, while the reset endpoint removes/deletes any fields or tasks not included in the request.

3. Trigger an Existing Job/Workflow:

Use the/api/2.1/jobs/run-now endpoint to trigger a job run on demand. Pass the input parameters to your notebook tasks using thenotebook_paramsfield.

curl -X POST https:///api/2.1/jobs/run-now \
  -H "Authorization: Bearer " \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": 947766456503851,
    "notebook_params": {
      "source_location": "s3:///"
    }
  }'

4. Get Job Status:

To check the status of a specific job run, use the /api/2.1/jobs/runs/get endpoint with the run_id. The response includes details about the run, including its state (e.g., PENDING, RUNNING, COMPLETED, FAILED etc).

curl -X GET \
  https://.cloud.databricks.com/api/2.1/jobs/runs/get?run_id= \
  -H "Authorization: Bearer "

5. Delete Job:

To remove an existing Databricks workflow, simply call the DELETE /api/2.1/jobs/delete endpoint using the Jobs API. This allows you to programmatically clean up outdated or unnecessary jobs as part of your pipeline management strategy.

curl -X POST https:///api/2.1/jobs/delete \
  -H "Authorization: Bearer " \
  -H "Content-Type: application/json" \
  -d '{ "job_id": 947766456503851 }'

Conclusion:

The Databricks Jobs API empowers data engineers to orchestrate complex workflows natively, without relying on external scheduling tools. Whether you’re automating notebook runs, chaining multi-step pipelines, or integrating with CI/CD systems, the API offers fine-grained control and flexibility. By mastering this API, you’re not just building workflows—you’re building scalable, production-grade data pipelines that are easier to manage, monitor, and evolve.

Databricks on Azure versus AWS

David Callaghan — Fri, 31 Jan 2025 19:19:28 +0000

As a Databricks Champion working for Perficient’s Data Solutions team, I spend most of my time installing and managing Databricks on Azure and AWS. The decision on which cloud provider to use is typically outside my scope since the organization has already made it. However, there are occasions when the client uses both hyperscalers or has not yet moved to the cloud. It is helpful in those situations to advise the client on the advantages and disadvantages of one platform over another from a Databricks perspective. I’m aware that I am skipping over the Google Cloud Platform, but I want to focus on the questions I am actually asked rather than questions that could be asked. I am also not advocating for one cloud provider over another. I am limiting myself to the question of AWS versus Azure from a Databricks perspective.

Advantages of Databricks on Azure

Databricks is a first-party service on Azure, which means it enjoys deep integration with the Microsoft ecosystem. Identity management in Databricks is integrated with Azure Active Directory (AAD) authentication, which can save time and effort in an area I have found difficult in large, regulated organizations. The same applies to deep integration with networking, private links, and Azure compliance frameworks. The value of this integration is amplified if the client also uses some combination of Azure Data Lake Storage (ADLS), Azure Synapse Analytics, or Power BI. The Databricks integration with these products on Azure is seamless. FinOps gets a boost in Azure for companies with an Azure Consumption Commitment (MACC), as Databricks’ costs can be applied against that number. Regarding cost management, Azure spot VMs can be used in some situations to reduce costs. Azure Databricks and ADLS Gen2/Blob Storage are optimized for high throughput, which reduces latency and improves I/O performance.

Disadvantages of Databricks in Azure

Databricks and Azure are tightly integrated within the Microsoft ecosystem. Azure Databricks uses Azure AD, role-based access control (RBAC), and network security groups (NSGs). These dependencies will require additional and sometimes complex configurations if you want to use a hybrid or multi-cloud approach. Some advanced networking configurations require enterprise licensing or additional manual configurations in the Azure Marketplace.

Advantages of Databricks on AWS

Azure is focused on seamless integration with Databricks, assuming the organization is a committed Microsoft shop. AWS takes the approach of providing more dials to tune in exchange for greater flexibility. Additionally, AWS offers a broad selection of EC2 instance types, Spot Instance options, and scalable S3 storage, which can result in better cost and performance optimization. Finally, AWS has more instance types than Azure, including more options for GPU and memory-optimized workloads. AWS has a more flexible spot pricing model than Azure. VPC Peering, Transit Gateway, and more granular IAM security controls than Azure make AWS a stronger choice for organizations with advanced security requirements and/or organizations committed to multi-cloud or hybrid Databricks deployments. Many advanced features are released in AWS before Azure. Photon is a good example.

Disadvantages of Databricks in AWS

AWS charges for cross-region data transfers, and S3 read/write operations can become costly, especially for data-intensive workloads. This can result in higher networking costs. AWS also has weaker native BI Integration when you compare Tableau on AWS versus PowerBI on Azure.

Conclusion

Databricks is a strong cloud database on all the major cloud providers. If your organization has already committed to a particular cloud provider, Databricks will work. However, I have been asked about the differences between AWS and Azure enough that I wanted to get all my thoughts down in one place. Also, I recommend a multi-cloud strategy for most of our client organizations for Disaster Recovery and Business Continuity purposes.

Contact us to discuss the pros and cons of your planned or proposed Databricks implementation. We can help you navigate the technical complexities that affect security, cost, and BI integrations.

Omnichannel Analytics Simplified – Optimizely Acquires Netspring

Alex Harris — Wed, 09 Oct 2024 12:53:32 +0000

Recently, the news broke that Optimizely acquired Netspring, a warehouse-native analytics platform.

I’ll admit, I hadn’t heard of Netspring before, but after taking a closer look at their website and capabilities, it became clear why Optimizely made this strategic move.

Simplifying Omnichannel Analytics for Real Digital Impact

Netspring is not just another analytics platform. It is focused on making warehouse-native analytics accessible to organizations of all sizes. As businesses gather more data than ever before from multiple sources – CRM, ERP, commerce, marketing automation, offline/retail – managing and analyzing that data in a cohesive way is a major challenge. Netspring simplifies this by enabling businesses to conduct meaningful analytics directly from their data warehouse, eliminating data duplication and ensuring a single source of truth.

By bringing Netspring into the fold, Optimizely has future-proofed its ability to leverage big data for experimentation, personalization, and analytics reporting across the entire Optimizely One platform.

Why Optimizely Acquired Netspring

Netspring brings significant capabilities that make it a best-in-class tool for warehouse-native analytics.

With Netspring, businesses can:

Run Product Analytics: Understand how users engage with specific products.
Analyze Customer Journeys: Dive deep into the entire customer journey, across all touchpoints.
Access Business Intelligence: Easily query key business metrics without needing advanced technical expertise or risking data inconsistency.

This acquisition means that data teams can now query and analyze information directly in the data warehouse, ensuring there’s no need for data duplication or exporting data to third-party platforms. This is especially valuable for large organizations that require data consistency and accuracy.

Ready to capitalize on these new features? Contact Perficient for a complimentary assessment!

The Growing Importance of Omnichannel Analytics

It’s no secret that businesses today are moving away from single analytics platforms. Instead, they are combining data from a wide range of sources to get a holistic view of their performance. It’s not uncommon to see businesses using a combination of tools like Snowflake, Google BigQuery, Salesforce, Microsoft Dynamics, Qualtrics, Google Analytics, and Adobe Analytics.
How?

These tools allow organizations to consolidate and analyze performance metrics across their entire omnichannel ecosystem. The need to clearly measure customer journeys, marketing campaigns, and sales outcomes across both online and offline channels has never been greater. This is where warehouse-native analytics, like Netspring, come into play.

Why You Need an Omnichannel Approach to Analytics & Reporting

Today’s businesses are increasingly reliant on omnichannel analytics to drive insights. Some common tools and approaches include:

Customer Data Platforms (CDPs): These platforms collect and unify customer data from multiple sources, providing businesses with a comprehensive view of customer interactions across all touchpoints.
Marketing Analytics Tools: These tools help companies measure the effectiveness of their marketing campaigns across digital, social, and offline channels. They ensure you have a real-time view of campaign performance, enabling better decision-making.
ETL Tools (Extract, Transform, Load): ETL tools are critical for moving data from various systems into a data warehouse, where it can be analyzed as a single, cohesive dataset.

The combination of these tools allows businesses to pull all relevant data into a central location, giving marketing and data teams a 360-degree view of customer behavior. This not only maximizes the return on investment (ROI) of marketing efforts but also provides greater insights for decision-making.

Navigating the Challenges of Omnichannel Analytics

While access to vast amounts of data is a powerful asset, it can be overwhelming. Too much data can lead to confusion, inconsistency, and difficulties in deriving actionable insights. This is where Netspring shines – its ability to work within an organization’s existing data warehouse provides a clear, simplified way for teams to view and analyze data in one place, without needing to be data experts. By centralizing data, businesses can more easily comply with data governance policies, security standards, and privacy regulations, ensuring they meet internal and external data handling requirements.

AI’s Role in Omnichannel Analytics

Artificial intelligence (AI) plays a pivotal role in this vision. AI can help uncover trends, patterns, and customer segmentation opportunities that might otherwise go unnoticed. By understanding omnichannel analytics across websites, mobile apps, sales teams, customer service interactions, and even offline retail stores, AI offers deeper insights into customer behavior and preferences.

This level of advanced reporting enables organizations to accurately measure the impact of their marketing, sales, and product development efforts without relying on complex SQL queries or data teams. It simplifies the process, making data-driven decisions more accessible.

Additionally, we’re looking forward to learning how Optimizely plans to leverage Opal, their smart AI assistant, in conjunction with the Netspring integration. With Opal’s capabilities, there’s potential to further enhance data analysis, providing even more powerful insights across the entire Optimizely platform.

What’s Next for Netspring and Optimizely?

Right now, Netspring’s analytics and reporting capabilities are primarily available for Optimizely’s experimentation and personalization tools. However, it’s easy to envision these features expanding to include content analytics, commerce insights, and deeper customer segmentation capabilities. As these tools evolve, companies will have even more ways to leverage the power of big data.

A Very Smart Move by Optimizely

Incorporating Netspring into the Optimizely One platform is a clear signal that Optimizely is committed to building a future-proof analytics and optimization platform. With this acquisition, they are well-positioned to help companies leverage omnichannel analytics to drive business results.

At Perficient, an Optimizely Premier Platinum Partner, we’re already working with many organizations to develop these types of advanced analytics strategies. We specialize in big data analytics, data science, business intelligence, and artificial intelligence (AI), and we see firsthand the value that comprehensive data solutions provide. Netspring’s capabilities align perfectly with the needs of organizations looking to drive growth and gain deeper insights through a single source of truth.

Ready to leverage omnichannel analytics with Optimizely?

Start with a complimentary assessment to receive tailored insights from our experienced professionals.

Connect with a Perficient expert today!
Contact Us

Smart Manufacturing, QA, Big Data, and More at The International Manufacturing Technology Show

Kevin Espinosa — Thu, 19 Sep 2024 14:43:19 +0000

For my first time attending the International Manufacturing Technology Show (IMTS), I must say it did not disappoint. This incredible event in Chicago happens every two years and is massive in size, taking up every main hall in McCormick Place. It was a combination of technology showcases, featuring everything from robotics to AI and smart manufacturing.

As a Digital Strategy Director at Perficient, I was excited to see the latest advancements on display representing many of the solutions that our company promotes and implements at the leading manufacturers around the globe. Not to mention, IMTS was the perfect opportunity to network with industry influencers as well as technology partners.

Oh, the People You Will Meet and Things You Will See at IMTS

Whenever you go to a show of this magnitude, you’re bound to run into someone you know. I was fortunate to experience the show with several colleagues, with a few of us getting to meet our Amazon Web Services (AWS) account leaders as well as Google and Microsoft.

The expertise of the engineers at each demonstration was truly amazing, specifically at one Robotic QA display. This robotic display was taking a series of pictures of automobile doors with the purpose of looking for defects. The data collected would go into their proprietary software for analysis and results. We found this particularly intriguing because we had been presented with similar use cases by some of our customers. We were so engrossed in talking with the engineers that our half-hour-long conversation felt like only a minute or two before we had to move on.

After briefly stopping to grab a pint—excuse me, picture—of the robotic bartender, we made our way to the Smart Manufacturing live presentation on the main stage. The ultra-tech companies presented explanations of how they were envisioning the future with Manufacturing 5.0 and digital twins, featuring big data as a core component. It was reassuring to hear this, considering that it’s a strength of ours, thus reinforcing the belief that we need to continue focusing on these types of use cases. Along with big data, we should stay the course with trends shaping the industry like Smart Manufacturing, which at its roots are a combination of operations management, cloud, AI, and technology.

Goodbye IMTS, Hello Future Opportunities with Robotics, AI, and Smart Manufacturing

Overall, IMTS was certainly a worthwhile investment. It provided a platform to connect with potential partners, learn about industry trends, and strengthen our relationships with technology partners. As we look ahead to future events, I believe that a focused approach, leveraging our existing partnerships and adapting to the evolving needs of the manufacturing industry, will be key to maximizing our participation.

If you’d like to discuss these takeaways from IMTS Chicago 2024 at greater depth, please be sure to connect with our manufacturing experts.

Hadoop Ecosystem Components

Ankita D Mendhekar — Wed, 10 Aug 2022 06:18:23 +0000

The Hadoop Ecosystem

Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions. 4 major elements of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common. Hadoop is a framework that enables the processing of large data sets which reside in the form of clusters. Being a framework, Hadoop was made up of several modules that are supported by a large ecosystem of technologies.

Components that collectively form a Hadoop ecosystem:

HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming-based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Zookeeper: Managing cluster
Oozie: Job Scheduling

What is Hadoop?

Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

It is an Open-source Data Management with scale-out storage & distributed processing.

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.

HDFS: Hadoop Distributed File System

HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework.
HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets
HDFS consists of two core components :
- Name Node
- Data Node

Name Node:

Name Node, a master server, manages the file system namespace and regulates access to files by clients.
Maintains and manages the blocks which are present on the data node.
Name Node is the prime node that contains metadata
Meta-data in Memory
– The entire metadata is in the main memory
Types of Metadata
– List of files
– List of Blocks for each file
– List of Data Nodes for each block
– File attributes, example: creation time, replication
A Transaction Log
– Records file creations, and file deletions. Etc.

Data Node:

Data Nodes, one per node in the cluster, manage storage attached to the nodes that they run on.
data nodes that store the actual data. These data nodes are commodity hardware in the distributed environment.
A Block Server
- Stores data in the local file system
- Stores meta-data of a block
- Serves data and meta-data to Clients
- Block Report
- Periodically sends a report of all existing blocks to the Name Node
Facilitates Pipelining of Data
- Forwards data to other specified Data Nodes

YARN: Yet Another Resource Negotiator

Apache YARN is Hadoop’s cluster resource management system.
YARN was introduced in Hadoop 2.0 for improving the MapReduce utilization.
It handles the cluster of nodes and acts as Hadoop’s resource management unit. YARN allocates RAM, memory, and other resources to different applications.

YARN has two components :

Resource Manager

Global resource scheduler
Runs on the master node
Manages other Nodes
- Tracks heartbeats from Node Manager
Manages Containers
- Handles AM requests for resources
- De-allocates containers when they expire, or the application completes
Manages Application Master
- Creates a container from AM and tracks heartbeats
Manages Security

Node Manager

Runs on slave node
Communicates with RM
- Registers and provides info on Node resources
- Sends heartbeats and container status
Manages processes and container status
- Launches AM on request from RM
- Launches application process on request from AM
- Monitors resource usage by containers.
Provides logging services to applications
- Aggregates logs for an application and saves them to HDFS

MapReduce: Programming-based Data Processing

HDFS handles the Distributed File system layer
MapReduce is a programming model for data processing.
MapReduce

– Framework for parallel computing

– Programmers get simple API

– Don’t have to worry about handling

parallelization
data distribution
load balancing
fault tolerance
Allows one to process huge amounts of data (terabytes and petabytes) on thousands of processors

Map Reduce Concepts (Hadoop-1.0)

Job Tracker

The Job Tracker is responsible for accepting jobs from clients, dividing those jobs into tasks, and assigning those tasks to be executed by worker nodes.

Task Tracker

Task-Tracker is a process that manages the execution of the tasks currently assigned to that node. Each Task Tracker has a fixed number of slots for executing tasks (two maps and two reduces by default).

Hadoop 2.0 Cluster Components

- Split up the two major functions of Job Tracker
  - Cluster resource management
  - Application lifecycle management
- Resource Manager
  - Global resource scheduler
  - Hierarchical queues
- Node Manager
  - Per-machine agent
  - Manages the life cycle of the container
  - Container resource monitoring
- Application Master
  - Per-application
  - Manages application scheduling and task execution
  - g. MapReduce Application Master

Hadoop as a Next-Gen Platform

Spark: In-Memory data processing

Spark is an open-source distributed processing system.
It is a cluster computing platform designed to be fast.
In memory computation (RAM) that increases the processing speed of an application.

Combines different processing types like

Batch processing
Streaming Data
Machine learning
Structure Data
Graph X Data

Batch Processing

It is the processing of big data at rest. You can filter, aggregate, and prepare very large datasets using long-running jobs in parallel.
It is the processing of data in a particular frequency of time.

Streaming data

Streaming or real-time, data is data in motion. Real-time data can be processed to provide useful information. using application data was generated immediately by streaming data.

Machine learning

Spark’s library for machine learning is called MLlib (Machine Learning library). It’s heavily based on learn’ s ideas on pipelines. In this library to create an ML model the basics concepts are:

Data Frame: This ML API uses Data Frame from Spark SQL as an ML dataset, which can hold a variety of data types.

Structured data

It is something that has a schema that has a known set of fields. When the schema and the data have no separation, then the data is said to be semi-structured.

RDD is an immutable data structure that distributes the data in partitions across the nodes in the cluster.

PIG, HIVE: Query-based processing of data services

PIG

To performed a lot of data administration operation, Pig Hadoop was developed by Yahoo which is Query based language works on a pig Latin language used with hadoop.

It is a platform for structuring the data flow, and processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, PIG stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM.

Features of PIG

Provides support for data types – long, float, char array, schemas, and functions
Is extensible and supports User Defined Functions
Provides common operations like JOIN, GROUP, FILTER, SORT

HIVE

Relational databases that use SQL as the query language implemented by most of data Most data warehouse application. Hive is a data warehousing package built on top of Hadoop that lowers the barrier to moving these applications to Hadoop.

Structured and Semi-Structured data Processing by using Hive.
Series of automatically generated Map Reduce jobs is internal execution of Hive query.
Structure data used for data analysis.

HBase: NoSQL Database

Apache HBase is an open-source, distributed, versioned, fault-tolerant, scalable, column-oriented store modeled after Google’s Bigtable, with random real-time read/write access to data.
It’s a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store.
It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce.

Mahout, Spark MLLib: Machine Learning algorithm libraries

Mahout provides an environment for creating machine learning applications that are scalable.
Mahout allows Machine Learnability to a system or application.
MLlib, Spark’s open-source distributed machine learning library.
MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.
It allows invoking algorithms as per our need with the help of its own libraries.

Zookeeper: Managing cluster

Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various services in a Hadoop Ecosystem.
Apache Zookeeper coordinates with various services in a distributed environment.
It is an open-source, distributed, and centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services across the cluster.
Zookeeper, There was a huge issue of management of coordination and synchronization among the resources or the components of the Hadoop Ecosystem.

Oozie: Job Scheduling

Apache Oozie is a clock and alarm service inside Hadoop Ecosystem.
Oozie simply performs the task scheduler, it combines multiple jobs sequentially into one logical unit of work.
Oozie is a workflow scheduler system that allows users to link jobs written on various platforms like MapReduce, Hive, Pig, etc. schedule a job in advance and create a pipeline of individual jobs was executed sequentially or in parallel to achieve a bigger task using Oozie.

There are two kinds of Oozie jobs:

Oozie Workflow

Oozie workflow is a sequential set of actions to be executed.

Oozie Coordinator

These are the Oozie jobs that are triggered by time and data availability

EXPLORE TIME TRAVEL IN SNOWFLAKE

Shubham Deshmukh — Mon, 25 Jul 2022 13:08:08 +0000

Have you ever wondered if it would ever be possible to time travel like in the old movies with a time machine? If we could go back in time and see the world, would we? If you asked me, I would have said yes! But not in the way that Hollywood portrays science fiction films. Today, we’ll look at one such feature. Go get ready to Time travel in the Data’s world of Snowflake.

Introduction to Time Travel

“Snowflake Time Travel enables accessing historical data (i.e., data that has been changed or deleted) at any point within a defined period.” – Snowflake

Time Travel is one of the cool features that Snowflake provides to its users. It allows us to recover data that has been changed or deleted at any point within a specified time frame.

We can do some amazing things with this powerful feature, such as:

We can recover deleted objects such as tables, schemas, and databases. So there’s no need to worry about new employees accidentally deleting data.
Duplicating and backing up data from key points in the past was never as simple as it is now.
Examine data usage and manipulation over specified time periods.

NOTE:

Snowflake’s Time travel define by Databases, Schemas, and Tables. The data retention period parameter specifies the amount of time we can view the table’s historical data. In all Snowflake editions, It is set to 1 day by default for all objects.

This parameter can be extended to 90 days for Enterprise and Business-Critical editions.

The parameter “DATA RETENTION PERIOD” controls an object’s time travel capability.

Once the time travel duration is exceeded the object enters the Fail-safe region. If you need to retrieve the object while it is in Fail safe mode, you must contact the snowflake itself.

The following SQL extensions have been implemented to support Time Travel:

The AT | BEFORE clause, which can be used in SELECT statements and CREATE… CLONE commands (immediately after the object name).

To pinpoint the exact historical data you want to access, the clause uses one of the following parameters:

TIMESTAMP
OFFSET (time difference in seconds from the present time)
STATEMENT (identifier for statement, e.g. query ID)

#select the data for the specified Query ID executed at specific period of time
SELECT * FROM OUR_FIRST_DB.public.test before (statement => '01a58f86-3200-7cb6-0001-25ce0002d232') //  Query ID

#select the data as of before a couple of (seconds, minutes, hours) ago in snowflake using the time travel
SELECT * FROM OUR_FIRST_DB.public.test before (offset => -300) //  seconds only

#select the data as of specified date time in snowflake using the time travel
select * from OUR_FIRST_DB.public.test  at (TIMESTAMP=>'2022-12-07 00:57:35.967'::timestamp) //  Timestamp

UNDROP command for tables, schemas, and databases.

#Will UNDROP TABLE
UNDROP TABLE TABLENAME

#Will UNDROP SCHEMA
UNDROP SCHEMA SCHEMANAME

#Will UNDROP DATABASE
UNDROP DATABASE DATABASENAME

Let us illustrate this with an example.

EXAMPLE OF TIME TRAVEL

Create an employee table with a 4-day data retention period. Note that I am using the DEMO_DB database and PUBLIC schema.

create or replace table EMPLOYEE (empid int ,emp_name varchar(20) ) data_retention_time_in_days=4;

insert into EMPLOYEE values(1,'Shubham');
insert into EMPLOYEE values(2,'Chandan');
insert into EMPLOYEE values(3,'Simran');
insert into EMPLOYEE values(4,'Nikita')
insert into EMPLOYEE values(5,'Achal');
insert into EMPLOYEE values(6,'Aditi');
select * from EMPLOYEE;

After 5 minutes, I inserted another row with EMPID 7 as follows:

insert into EMPLOYEE values(7,'Shobit'); 
select * from EMPLOYEE;

The table now has 7 rows, but let’s go back 5 minutes and see how the table looks.

select * from EMPLOYEE at(offset=>-60*5);

In this way, you can check the data the table holds in the past.

Final Reflections

This brings us to the conclusion about the Snowflake time travel. This article has taught us what time travel is and how to use it in Snowflake. Additionally, I have demonstrated to you how to customize the Snowflake retention settings at table levels. I hope you gained an overview of one of Snowflake’s most significant features.

Please share your thoughts and suggestions in the space below, and I’ll do my best to respond to all of them as time allows.

Refer to the official Snowflake documentation here if you want to learn more.

for more such blogs click here

Top 5 take-aways from Databricks Data – AI Summit 2022

Neha Raggad — Mon, 11 Jul 2022 05:48:40 +0000

The Data and AI Summit 2022 had enormous announcements for the Databricks Lakehouse platform. Among these, there were several exhilarating enhancements to Databricks Workflows, the fully managed orchestration service that is deeply integrated with the Databricks Lakehouse Platform and Delta Live tables too. With these new efficacies, Workflows enables data engineers, data scientists and analysts to build reliable data, analytics, and ML workflows on any cloud without needing to manage complex infrastructure.

Following are the 5 exciting and most important announcements for the same –

1. Build Reliable Production Data and ML Pipelines with Git Support:
We use Git to version control all of our code. With Git support in Databricks Workflows, you can use a remote Git reference as the source for tasks that make up a Databricks Workflow. This eliminates the risk of accidental edits to production code, removes the overhead of maintaining a production copy of the code in Databricks and keeping it updated, and improves reproducibility as each job run is linked to a commit hash. Git support for Workflows is available in Public Preview and works with a wide range of Databricks supported Git providers including GitHub, Gitlab, Bitbucket, Azure DevOps and AWS CodeCommit.
1. Orchestrate even more of the lake house with SQL tasks:
Real-world data and ML pipelines consist of many different types of tasks working together. With the addition of SQL task type in Jobs, you can now orchestrate even more of the lakehouse. For example, we can trigger a notebook to ingest data, run a Delta Live Table Pipeline to transform the data, and then use the SQL task type to schedule a query and refresh a dashboard.
1. Save Time and Money on Data and ML Workflows With “Repair and Rerun”:
To support real-world data and machine learning use cases, organizations create sophisticated workflows with numerous tasks and dependencies, ranging from data ingestion and ETL to ML model training and serving. Each of these tasks must be completed in the correct order. However, when an important task in a workflow fails, it affects all downstream tasks. The new “Repair and Rerun” capability in Jobs addresses this issue by allowing you to run only failed tasks, saving you time and money.
1. Easily share context between tasks:
A task may sometimes be dependent on the results of a task upstream. Previously, in order to access data from an upstream task, it was necessary to store it somewhere other than the context of the job, like a Delta table. The Task Values API now allows tasks to set values that can be retrieved by subsequent tasks. To facilitate debugging, the Jobs UI displays values specified by tasks.
1. Delta Live Tables Announces New Capabilities and Performance Optimizations:
DLT announces it is developing Enzyme, a performance optimization purpose-built for ETL workloads, and launches several new capabilities including Enhanced Autoscaling. DLT enables analysts and data engineers to quickly create production-ready streaming or batch ETL pipelines in SQL and Python. DLT simplifies ETL development by allowing you to define your data processing pipeline declaratively. DLT comprehends your pipeline’s dependencies and automates nearly all operational complexities.
- UX improvements – Extended UI to make managing DLT pipelines easier, view errors, and provide access to team members with rich pipeline ACLs. Also, an observability UI to see data quality metrics in a single view and made it easier to schedule pipelines directly from the UI is added.
- Schedule Pipeline button – DLT lets you run ETL pipelines continuously or in triggered mode. Continuous pipelines process new data as it arrives and are useful in scenarios where data latency is critical. However, many customers choose to run DLT pipelines in triggered mode to control pipeline execution and costs more closely. To make it easy to trigger DLT pipelines on a recurring schedule with Databricks Jobs, a ‘Schedule’ button is added in the DLT UI to enable users to set up a recurring schedule with only a few clicks without leaving the DLT UI.
- Change Data Capture (CDC) –With DLT, data engineers can easily implement CDC with a new declarative APPLY CHANGES INTO API, in either SQL or Python. This new capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse. DLT processes data changes into the Delta Lake incrementally, flagging records to insert, update, or delete when handling CDC events.
- CDC Slowly Changing Dimensions (Type 2) –When dealing with changing data (CDC), we often need to update records to keep track of the most recent data. SCD Type 2 is a way to apply updates to a target so that the original data is preserved. For example, if a user entity in the database changes their phone numbers, we can store all previous phone numbers for that user. DLT supports SCD type 2 for organizations that require maintaining an audit trail of changes. SCD2 retains a full history of values. When the value of an attribute changes, the current record is closed, a new record is created with the changed data values, and this new record becomes the current record.
- Enhanced Autoscaling (preview) –Sizing clusters manually for optimal performance given changing, unpredictable data volumes–as with streaming workloads– can be challenging and lead to overprovisioning. DLT employs an enhanced auto-scaling algorithm purpose-built for streaming. DLTs Enhanced Autoscaling optimizes cluster utilization while ensuring that overall end-to-end latency is minimized. It does this by detecting fluctuations of streaming workloads, including data waiting to be ingested, and provisioning the right number of resources needed (up to a user-specified limit). In addition, Enhanced Autoscaling will gracefully shut down clusters whenever utilization is low while guaranteeing the evacuation of all tasks to avoid impacting the pipeline. As a result, workloads using Enhanced Autoscaling save on costs because fewer infrastructure resources are used.
- Automated Upgrade & Release Channels –Delta Live Tables (DLT) clusters use a DLT runtime based on Databricks runtime (DBR). Databricks automatically upgrades the DLT runtime about every 1-2 months. DLT will automatically upgrade the DLT runtime without requiring end-user intervention and monitor pipeline health after the upgrade.
- Announcing Enzyme, a new optimization layer designed specifically to speed up the process of doing ETL – Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and Data frames make it relatively easy for users to express their transformations, the input data constantly changes. This requires re-computation of the tables produced by ETL. Enzyme is a a new optimization layer for ETL. Enzyme efficiently keeps up to date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by data engineers.
Learn more on –

https://databricks.com/blog/2022/06/29/top-5-workflows-announcements-at-data-ai-summit.html

New Delta Live Tables Capabilities and Performance Optimizations – The Databricks Blog

How to implement incremental loading in Snowflake using Stream and Merge

Praful Pelne — Wed, 06 Jul 2022 08:29:58 +0000

Snowflake is Cloud hosted relational database used to create Datawarehouse on demand. Data in the data warehouse can be loaded as full load or incremental load. The full load is a process of deleting whole existing data and reloading it again. Full loads are time and resource-consuming tasks compared to incremental loads that only load a small amount of new or updated data instead of loading full data every time. We can achieve incremental loading in snowflake by implementing change data capture (CDC)using Stream and Merge objects. Stream object is used for change data capture which includes inserts, updates, and deletes, as well as metadata about each change so that actions can be taken using the changed data. The data captured using stream is then merged to the target table using match and not match condition.

What are Stream and Merge?

Merge–

Merge is command is used to perform some alterations on the table, to update the existing records, delete the old/inactive records, or add new rows from another table.

Snowflake offers two clauses to perform Merge:

Matched Clause – Matched Clause performs Update and Delete operation on the target table when the rows satisfy the condition.
Not Matched Clause – Not Matched Clause performs the Insert operation when the row satisfying conditions are not matched. The rows from the source table that are not matched with the target table will be inserted.

Stream–

Stream is a table created on the top of the source to capture change data; it tracks the changes made to source table rows.

The created stream object just holds the offset from where change data capture can be tracked, however, the main data in source remain unaltered.

3 additional columns are added to the source table in a stream-

Column	Description
METADATA$ACTION	It may have only two values Insert/Delete
METADATA$ISUPDATE	This will be flagged as True if the record is an updated
METADATA$ROW_ID	There are unique hash keys that will be tracked against each change.

As we know now what is stream and merge , Let’s see how to use stream and merge to load the data-

Step 1-

Connect to the Snowflake DB and Create sample source and target tables

Step2-

Create stream on source table using below query-

Step3–

Let’s insert some dummy data into the source table-

After inserting data into the source let’s check data captured in the stream-

As we inserted data the first time in the source the newly inserted rows will be flagged as INSERT in the METADATA$ACTION column and METADATA$UPDATE as FALSE in the stream.

Step4-

Inset data into target using stream and merge using below query-

As we are inserting data the first time there will not be any matching personal_id in the target table and as the METADATA$ACTION flag is INSERT, the merge command will insert the whole data into the target table as it is.

Step5–

Let’s update a few source rows and load them again to target-

As soon as we update the source table, the stream will capture these changes and update the stream data.

The updated row will be marked as INSERT and the older row which we updated will be marked as Delete in the METADATA$ACTION column. so that when we load updated data from source to target older row with City Nagpur will get deleted and updated row with City Mumbai will get inserted.

Again run the same stream and merge command we used earlier to load only updated data in target, updated target data will look like this-

Here you have successfully achieved incremental loading using snowflake.

To automate this load process we can create a task, this task will run after a specified time interval and load data into the target if there are any source changes.

Happy Reading!

Nearshore Software Development in Action: What Delivery Success Looks Like

Stephanie Gallina — Thu, 21 Jan 2021 18:24:55 +0000

The stakes to evolve your business and adapt to new realities have never been higher. How do you accelerate and scale your transformation cost-effectively? The answer – optimized global delivery. Follow this series and learn more about our nearshore software development capabilities from members of our global consulting team in Colombia.

When coordinating software development projects, you must consider multiple variables at any given time. Tack on virtual collaboration and delivery, and the project is further complicated.

Gustavo Arroyave, Perfcient Global Delivery

Fortunately, our nearshore software development team, also known as Perficient Latin America, has more than 15 years’ experience with successful virtual project delivery. Throughout numerous client projects, we recognize that our deep-rooted philosophy and culture establishes a secure foundation for delivering projects that exceed our clients’ expectations.

In a conversation with Gustavo Arroyave, a technical delivery leader with Perficient Latin America, he shares more about the unique culture that elevates our team in comparison to other outsourcing software vendors. He also highlights successful outcomes with our long-term clients and the ways we’ve exceeded their expectations.

Shining a Light on Our Nearshore Software Development Team

What does collaboration look like within the context of how we deliver projects?

Gustavo: Our culture and philosophy are what make a difference [compared to] other vendors. The pillars for building successful client relationships include:

Commitment to transparency with clients
Constant communication with our peers on the client team
Effective client participation

We don’t see our clients as different from us. We’re all part of a unique team.

For example, when we began working with one of our long-term customers, a product manager told me, “With other offshore delivery teams, it felt like they were either working for us, or we were working for them. But with Perficient, we’re collaborating on the same level – part of a single team.”

Part of our [employee] culture is empowering team members and making sure they’re comfortable communicating with customers to ask questions or provide constructive feedback. This makes a difference in our ability to collaborate and develop a partnership with them.

How do we build a delivery team that aligns our talent with clients’ goals?

Gustavo: The composition of every delivery team depends upon the project and goals of our customers.

Most teams have a similar composition, such as full-stack developers that can do front to back-end development, or front-end and back-end specialists. These delivery roles are key because they involve building the product.

Delivery teams also include testers to verify what the development team builds. While requirement managers work with our developers to clarify any questions related to our customers’ business.

In some cases, delivery teams will also have UX or UI designers, DevOps to support infrastructure needs, and/or Scrum Masters.

However, it’s a discovery process when building a team. We need to understand our customers’ business needs and expectations. That way, our delivery is more transversal to ensure these needs are addressed, and the team will perform as expected.

Once we understand the goals and a specific target date, we have technical people participate in initial client discussions to understand or identify potential challenges. Then, we start defining a small team to begin building the product – usually around six people. From there, we can assess whether or not to increase the team size up to 10 or 12 within the project. It just depends on the client’s needs.

Nearshore agile teams can perform incredibly well if they have competent leaders who understand how to manage a virtual team. Learn more about successful techniques for leading nearshore agile teams.

Delivery in Action from Our Nearshore Development Team

Developing a Modern, Big Data Marketplace Platform

One of our established clients is a marketing technology company that delivers seamless data-driven marketing solutions to its customers. To remain competitive, the company must quickly develop and deploy new technologies and sought help to build a marketplace application.

After earlier attempts to outsource development, our client faced several hurdles to develop a minimum viable product (MVP).

Solution

We served as a true partner, implementing an Agile approach to development and successfully delivering the new marketplace application. The solution features a robust, big data backend that interacts with other parts of our client’s larger platform.

Results

Customers that use the marketplace, including agencies that work with advertisers, can easily access updated data and efficiently communicate with media planners and buyers.

Why did our client choose Perficient as its nearshore partner?

Gustavo: Finding a solid nearshore development partner was one of the key reasons that our client initially connected with us. The company had a very thorough selection process to find the right partner.

The marketplace application was our first project, and it’s still ongoing. This product is very technical and specific to the [client’s] business. Because we built the solution from scratch, we had the opportunity to demonstrate the strength of all our capabilities – not only the technical expertise but also with Agile.

Our delivery team showed how we live and breathe the Agile approach. We built the marketplace application incrementally and through iterations. Our client previously worked through dependencies in the process. For example, for some teams to build the front-end [of the application], they needed to have the backend ready. So then, the backend becomes a dependency for the front-end developers.

Since our client wanted an automated solution, the teams needed the frontend finished for the testers to implement automation. It could take up to three sprints for our client to build a feature.

By establishing an Agile process, our delivery team reduced the time required to build the feature. We started working on frontend, backend, and automation in parallel and within the same sprint. This shows how we brought innovation to the development process. And, this is part of what became standard for the rest of our client’s teams.

How did the client respond to our delivery approach?

Gustavo: The reduced development time impressed our client because we built front-end and back-end parts in parallel and automated them in one sprint.

The company also values our commitment to transparency. As mentioned earlier, this is key within our culture and way of working – to speak up even when things are not working as expected.

For a team to succeed, every person on the team – both our delivery and client team – needs to be committed and prepared to accept the challenge. If we see anyone who is not prepared, then we communicate that to our clients.

This is part of the feedback that we provide. We inform our clients of challenges with the projects, which may include people within client teams who aren’t leading in the way we expect. Similarly, we recognize and speak up if our team is under-performing and not delivering as expected. Then, we introduce actions to help the team improve, or we make changes to our delivery team so that we align the right person to the right challenge.

Paving the Way for Reliable, Safe Transportation

A fleet management company, which grew through several acquisitions, has a vast collection of safety products it develops, manufactures, and sells to public transportation providers. Our client has maintained its outsourcing partnership with Perficient Latin America since 2010.

Our delivery teams support the development of various products for the business, supplying expertise in machine learning, automation, user interface (UI), DevOps, and more.

Solutions

Among the products currently in development is a real-time alert and image recognition system for buses. Using external cameras, the system assesses the speed of passing vehicles near the bus and identifies lane infractions and other nearby cars. Based on data captured and analyzed by the system, an alert notifies drivers of these external risks, so they do not deploy the stop arm.

By applying our expertise from the previous use case, we’re using machine learning algorithms and data science to automate a surveillance system for illegally parked vehicles. In the future, this system will use recognition software that captures the vehicle’s information and automatically sends the evidence to a platform that will fine the violator.

Results

We anticipate that customers (transit providers and cities/municipalities) can improve safety within their communities and simplify their operations.

How has our partnership helped the client’s business?

Gustavo: Considering our client’s growth through mergers and acquisitions, we’ve been working alongside their teams, gaining a deeper knowledge of the business throughout these events.

Over the course of our partnership, some delivery teams are focused on supporting and improving existing products. We’ve removed redundancies and built efficient platforms, which has ultimately reduced operational costs for our client.

Meanwhile, our other delivery teams support the company’s new vision by developing the innovative solutions mentioned earlier. Building these innovations come with challenges at times. However, our delivery approach and commitment to constant communication – not only with the product team but also with executives – are among the many reasons this client values our partnership and continues to bring more work to us.

If you’re evaluating nearshore partners for software development…

Our global delivery teams within Perficient Latin America are committed to a culture that emphasizes excellence, honesty, transparency, innovation, and the concept of failing forward. To facilitate successful virtual work with a nearshore partner, these characteristics are instrumental. Now more than ever, delivery teams must be set up and managed without being in the same room together.

As your nearshore development partner, we make this possible because of our culture that values fluid communication and collaboration.

Our delivery teams have proven experience working with US-based clients on complex, cloud-native product development. Learn more about outsourcing software development and finding the right fit with a nearshore development partner.

[Podcast] Financial Services Trends and Data

Meghan Frederick — Fri, 15 Jan 2021 11:00:42 +0000

COVID-19 has undoubtedly affected financial services trends in 2020 and will continue to do so into 2021. Since the pandemic began, financial services organizations have been responding to the crisis with continuity plans to address everything from bankruptcies to people losing their jobs and ability to pay their bills on time. Now more than ever customers need to be supported with trust, transparency and data-based decision making.

In season 1 episode 1 of the Intelligent Data Podcast, host Arvind Murali and his guest Scott Albahary, Perficient’s Chief Strategist of Financial Services, discuss financial services trends, how data is influencing change in the industry, and what you need to think about as recovery from the pandemic begins.

Listening Guide

Financial Services Trends and Data

Data and AI-based disruptions [2:14]
Customer intelligence and the “Universal Banker” [4:03]
Master data management and data governance adoption [6:10]
How is big data being used? [11:12]
How industry leaders are structuring their digital ecosystem [15:10]
Robo-advisors and wealth management [17:40]
Cybersecurity and data protection [22:59]
Post-COVID advice for personalization and digital interactions [23:50]

Get This Episode Where You Listen

Connect with the Host and Guest

Arvind Murali, Perficient Principal and Chief Strategist

LinkedIn | Perficient

Scott Albahary, Perficient Chief Strategist, Financial Services

LinkedIn | Perficient

Learn More About Our FinServ Solutions

If you are interested in learning more about Perficient’s financial services capabilities or would like to contact us, click here.

Introducing Intelligent Data, a podcast from Perficient

Meghan Frederick — Mon, 14 Dec 2020 12:00:44 +0000

The COVID-19 pandemic has done a great job revealing trouble spots and gaps in many companies’ technology strategies this year. If you’ve discovered that your data strategy and technology solutions need improvement, then this podcast is for you. Join host Arvind Murali, Principal and Chief Strategist of Data at Perficient, for the first season of Intelligent Data. Arvind and thought leaders will discuss the value of data within key industries and explore ways to keep your business moving forward.

The first episodes are coming later this month, but you can check out the trailer in the meantime. Subscribe to Intelligent Data on Apple, Google, Spotify, Amazon, or wherever you listen to podcasts.

What to Expect in Season 1 of Intelligent Data

Season one will include episodes around:

Data and customer experience trends in financial services
Data and analytics, AI, and data privacy in healthcare
The value of data in artificial intelligence
Artificial and machine learning trends
Business intelligence (BI) trends
Big data support for making business decisions
The influence of BI solutions on analytics and decision making
Interoperability, data compliance, and data governance in healthcare
The value of data in ecommerce, supply chain, and order management
Data’s influence on customer experience and design
The importance of collaboration between UX designers and data engineers

Subscribe to The Podcast

What are you waiting for? Subscribe now!