Dancing with Data It was but a fortnight into 2024 AC (After Conquest) when the great council gathered to decide who would succeed Perficient’s 2023 Data & AI Summit attendees. Many claims were heard, but only a few were considered. The council was assembled to prevent a war from being fought over the succession, for […]
Posts Tagged ‘Databricks’
Einstein Personalization and Salesforce Connections 2024: AI Integration at the Forefront
Attending Salesforce Connections 2024 at McCormick Place in Chicago was an energizing experience, highlighting the forefront of AI integration in business operations. The event gathered industry leaders to explore the latest advancements in artificial intelligence, data integration, and commerce, with a clear focus on how these technologies are reshaping the business landscape. One of the […]
Apache Spark: Merging Files using Databricks
In data engineering and analytics workflows, merging files emerges as a common task when managing large datasets distributed across multiple files. Databricks, furnishing a powerful platform for processing big data, prominently employs Scala. In this blog post, we’ll delve into how to merge files efficiently using Scala on Databricks. Introduction: Merging files entails combining the […]
Introduction to Star and Snowflake schema
In the world of data warehousing and business intelligence, two key concepts are fundamental: Snowflake and Star Schema. These concepts play a pivotal role in designing effective data models for analyzing large volumes of data efficiently. Let’s delve into what Snowflake and Star Schema are and how they are used in the realm of data […]
Databricks strengthens MosaicAI with Lilac
Databricks has acquired LilacAI as it continues to strengthen its end-to-end data intelligence platform. The 2023 acquisition of MosaicML gave Databricks significant capabilities in the in the Generative AI space with the ability to train and deploy Large Language Models (LLMs) at scale. Next, Databricks purchased Arcion to provide native real-time data ingestion into their […]
Using Snowflake and Databricks Together
This is not another comparison between Databricks and Snowflake; they’re not hard to find. This is a practical guide about using Databricks and Snowflake together in your organization. Many companies have both products implemented. Sometimes, there is a discrepancy between the two as far as the data being stored, creating new data silos. The Databricks […]
Writing Testable Python Objects in Databricks
I’ve been writing about Test-Driven Development in Databricks and some of the interesting issues that you can run into with Python objects. It’s always been my opinion that code that is not testable is detestable. Admittedly, its been very difficult getting to where I wanted to be with Databricks and TDD. Unfortunately, it’s hard to […]
Understanding the role of Py4J in Databricks
I mentioned that my attempt to implement TDD with Databricks was not totally successful. Setting up the local environment was not a problem and getting a service id for CI/CD component was more of an administrative than a technical problem. Using mocks to test python objects that are serialized to Spark is actually the issue. […]
Test Driven Development with Databricks
I don’t like testing Databricks notebooks and that’s a problem. I like Databricks. I like Test Driven Development. Not in an evangelical; 100% code coverage or fail kind of way. I just find that a reasonable amount of code coverage gives me a reasonable amount of confidence. Databricks has documentation for unit testing. I tried […]
Spark DataFrame: Writing into Files
This blog post explores how to write Spark DataFrame into various file formats for saving data to external storage for further analysis or sharing. Before diving into this blog have a look at my other blog posts discussing about creating the DataFrame and manipulating the DataFrame along with writing a DataFrame into tables and views. […]
Spark SQL Properties
The spark.sql.* properties are a set of configuration options specific to Spark SQL, a module within Apache Spark designed for processing structured data using SQL queries, DataFrame API, and Datasets. These properties allow users to customize various aspects of Spark SQL’s behavior, optimization strategies, and execution environment. Here’s a brief introduction to some common spark.sql.* […]
Databricks Lakehouse Federation Public Preview
Sometimes, its nice to be able to skip a step. Most data projects involve data movement before data access. Usually this is not an issue; everyone agrees that the data must be made available before it can be available. There are use cases where the data movement part is a blocker because of time, cost, […]