Databricks Articles / Blogs / Perficient https://blogs.perficient.com/tag/databricks/ Expert Digital Insights Wed, 01 Oct 2025 18:12:02 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png Databricks Articles / Blogs / Perficient https://blogs.perficient.com/tag/databricks/ 32 32 30508587 Agentic AI for Real‑Time Pharmacovigilance on Databricks https://blogs.perficient.com/2025/10/01/modern-pharmacovigilance-ai-databricks/ https://blogs.perficient.com/2025/10/01/modern-pharmacovigilance-ai-databricks/#comments Wed, 01 Oct 2025 18:12:02 +0000 https://blogs.perficient.com/?p=387598

Adverse drug reaction (ADR) detection is a primary regulatory and patient-safety priority for life sciences and health systems. Traditional pharmacovigilance methods often depend on delayed signal detection from siloed data sources and require extensive manual evidence collection. This legacy approach is time-consuming, increases the risk of patient harm, and creates significant regulatory friction. For solution architects and engineers in healthcare and finance, optimizing data infrastructure to meet these challenges is a critical objective and a real headache.

Combining the Databricks Lakehouse Platform with Agentic AI presents a transformative path forward. This approach enables a closed-loop pharmacovigilance system that detects high-quality safety signals in near-real time, autonomously collects corroborating evidence, and routes validated alerts to clinicians and safety teams with complete auditability. By unifying data and AI on a single platform through Unity Catalog, organizations can reduce time-to-signal, increase signal precision, and provide the comprehensive data lineage that regulators demand. This integrated model offers a clear advantage over fragmented data warehouses or generic cloud stacks.

The Challenges in Modern Pharmacovigilance

To build an effective pharmacovigilance system, engineers must integrate a wide variety of data types. This includes structured electronic health records (EHR) in formats like FHIR, unstructured clinical notes, insurance claims, device telemetry from wearables, lab results, genomics, and patient-reported outcomes. This process presents several technical hurdles:

  • Data Heterogeneity and Velocity: The system must handle high-velocity streams from devices and patient apps alongside periodic updates from claims and EHR systems. Managing these disparate data types and speeds without creating bottlenecks is a significant challenge.
  • Sparse and Noisy Signals: ADR mentions can be buried in unstructured notes, timestamps may conflict across sources, and confounding variables like comorbidities or polypharmacy can obscure true signals.
  • Manual Evidence Collection: When a potential signal is flagged, safety teams often must manually re-query various systems and request patient charts, a process that delays signal confirmation and response.
  • Regulatory Traceability: Every step, from detection to escalation, must be reproducible. This requires clear, auditable provenance for both the data and the models used in the analysis.

The Databricks and Agentic AI Workflow

An agentic AI framework running on the Databricks Lakehouse provides a structured, scalable solution to these problems. This system uses modular, autonomous agents that work together to implement a continuous pharmacovigilance workflow. Each agent has a specific function, from ingesting data to escalating validated signals.

Step 1: Ingest and Normalize Data

The foundation of the workflow is a unified data layer built on Delta Lake. Ingestion & Normalization Agents are responsible for continuously pulling data from various sources into the Lakehouse.

  • Continuous Ingestion: Using Lakeflow Declarative Pipelines and Spark Structured Streaming, these agents ingest real-time data from EHRs (FHIR), claims, device telemetry, and patient reports. Data can be streamed from sources like Kafka or Azure Event Hubs directly into Delta tables.
  • Data Normalization: As data is ingested, agents perform crucial normalization tasks. This includes mapping medical codes to standards like RxNorm, SNOMED, and LOINC. They also resolve patient identities across different datasets using both deterministic and probabilistic linking methods, creating a canonical event timeline for each patient. This unified view is essential for accurate signal detection.

Step 2: Detect Signals with Multimodal AI

Once the data is clean and unified, Signal Detection Agents apply a suite of advanced models to identify potential ADRs. This multimodal approach significantly improves precision.

  • Multimodal Detectors: The system runs several types of detectors in parallel. Clinical Large Language Models (LLMs) and fine-tuned transformers extract relevant entities and context from unstructured clinical notes. Time-series anomaly detectors monitor device telemetry for unusual patterns, such as spikes in heart rate from a wearable.
  • Causal Inference: To distinguish true causality from mere correlation, statistical and counterfactual causal engines analyze the data to assess the strength of the association between a drug and a potential adverse event.
  • Scoring and Provenance: Each potential ADR is scored with an uncertainty estimate. Crucially, the system also attaches provenance pointers that link the signal back to the specific data and model version used for detection, ensuring full traceability.

Step 3: Collect Evidence Autonomously

When a candidate signal crosses a predefined confidence threshold, an Evidence Collection Agent is activated. This agent automates what is typically a manual and time-consuming process.

  • Automated Assembly: The agent automatically assembles a complete evidence package. It extracts relevant sections from patient charts, re-runs queries for lab trends, fetches associated genomics variants, and pulls specific windows of device telemetry data.
  • Targeted Data Pulls: If the initial evidence is incomplete, the agent can plan and execute targeted data pulls. For example, it could order a specific lab test, request a clinician chart review through an integrated system, or trigger a patient survey via a connected app to gather more information on symptoms and dosing adherence.

Step 4: Triage and Escalate Signals

With the evidence gathered, a Triage & Escalation Agent takes over. This agent applies business logic and risk models to determine the appropriate next step.

  • Composite Scoring: The agent aggregates all collected evidence and computes a composite risk and confidence score for the signal. It applies configurable business rules based on factors like event severity and regulatory reporting timelines.
  • Intelligent Escalation: For high-risk or ambiguous signals, the agent automatically escalates the issue to human safety teams by creating tickets in systems like Jira or ServiceNow. For clear, high-confidence signals that pose a lower operational risk, the system can be configured to auto-generate regulatory reports, such as 15-day expedited submissions, where permitted.

Step 5: Enable Continuous Learning

The final agent in the workflow closes the loop, ensuring the system improves over time. The Continuous Learning Agent uses feedback from human experts to refine the AI models.

  • Feedback Integration: Outcomes from chart reviews, follow-up labs, and final regulatory adjudications are fed back into the system’s training pipelines.
  • Model Retraining and Versioning: This new data is used to retrain and refine the signal detectors and causal models. MLflow tracks these updates, versioning the new models and linking them to the training data snapshot. This creates a fully auditable and continuously improving system that meets strict regulatory standards for model governance.

The Technical Architecture on Databricks

The power of this workflow comes from the tightly integrated components of the Databricks Lakehouse Platform.

  • Data Layer: Delta Lake serves as the single source of truth, storing versioned tables for all data types. Unity Catalog manages fine-grained access policies, including row-level masking, to protect sensitive patient information.
  • Continuous ETL & Feature Store: Delta Live Tables provide schema-aware pipelines for all data engineering tasks, while the integrated Feature Store offers managed feature views for models, ensuring consistency between training and inference.
  • Detection & Inference: Databricks provides integrated GPU clusters for training and fine-tuning clinical LLMs and other complex models. MLflow tracks experiments, registers model versions, and manages deployment metadata.
  • Agent Orchestration: Lakeflow Jobs coordinate the execution of all agent tasks, handling scheduling, retries, and dependencies. The agents themselves can be lightweight microservices or notebooks that interact with Databricks APIs.
  • Serving & Integrations: The platform offers low-latency model serving endpoints for real-time scoring. It can integrate with clinician portals via SMART-on-FHIR, ticketing systems, and messaging services to facilitate human-in-the-loop workflows.

Why This Approach Outperforms Alternatives

Architectures centered on traditional data warehouses like Snowflake often struggle with this use case because they separate storage from heavy ML compute. Tasks like LLM inference and streaming feature engineering require external GPU clusters and complex orchestration, which introduces latency, increases operational overhead, and fractures data lineage across systems. Similarly, a generic cloud stack requires significant integration effort to achieve the same level of data and model governance.

The Databricks Lakehouse co-locates multimodal data, continuous pipelines, GPU-enabled model lifecycles, and governed orchestration on a single, unified platform. This integration dramatically reduces friction and provides a practical, auditable, and scalable path to real-time pharmacovigilance. For solution architects and engineers, this means a faster, more reliable way to unlock real-time insights from complex healthcare data, ultimately improving patient safety and ensuring regulatory compliance.

Conclusion

By harnessing Databricks’ unified Lakehouse architecture and agentic AI, organizations can transform pharmacovigilance from a reactive, manual process into a proactive, intelligent system. This workflow not only accelerates adverse drug reaction detection but also streamlines evidence collection and triage, empowering teams to respond swiftly and accurately. The platform’s end-to-end traceability, scalable automation, and robust data governance support stringent regulatory demands while driving operational efficiency. Ultimately, implementing this modern approach leads to better patient outcomes, reduced risk, and a future-ready foundation for safety monitoring in life sciences.

Perficient is a Databricks Elite PartnerContact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/10/01/modern-pharmacovigilance-ai-databricks/feed/ 1 387598
Agentic AI Closed-Loop Systems for N-of-1 Treatment Optimization on Databricks https://blogs.perficient.com/2025/09/29/agentic-ai-closed-loops-n-of-1-treatment-optimization-databricks/ https://blogs.perficient.com/2025/09/29/agentic-ai-closed-loops-n-of-1-treatment-optimization-databricks/#respond Mon, 29 Sep 2025 21:41:57 +0000 https://blogs.perficient.com/?p=387434

Precision therapeutics for rare diseases as well as complex oncology cases is an area that may benefit from Agentic AI Closed-Loop (AACL) systems to enable individualized treatment optimization — a continuous process of proposing, testing, and adapting therapies for a single patient (N-of-1 trials).

N-of-1 problems are not typical for either clinicians or data systems. Type 2 diabetes in the US is more of an N-of-3.8×10^7 problem, so we’re looking at a profoundly different category of scaling. This lower number is not easier, because it implies existing treatment protocols have not been successful. N-of-1 optimization can discover effective regimens rapidly, but only with a data system that can manage dense multimodal signals (omics, time-series biosensors, lab results), provide fast model iteration, incorporate clinician-in-the-loop safety controls, and ensure rigorous provenance. We also need to consider the heavy cognitive load the clinician will be under. While traditional data analytics and machine learning algorithms will still play a key role, Agentic AI support can be invaluable.

Agentic AI Closed-Loop systems are relatively new, so let’s look at what a system designed to support this architecture would look like from the ground up.

Data Platform

First, let’s define the foundation of what we are trying to build. We need a clinical system that can deliver reproducible results with full lineage and enable safe automation to augment clinical judgement. That’s a decent overview of any clinical data system, so I feel like we’re on solid ground. I would posit that individualized treatment optimizations need a reduced iteration time from the standard, just because the smaller N means we have moved farther from the SoC, so there will likely be more experiments. Further, these experiments will need more clever validations. Siloed and fragmented data stores, disconnected data, disjoint model operationalization and heavy ETL are non-starters based on our foundational assumptions. A data lakehouse is a more appropriate architecture.

A data lakehouse is a unified data architecture that blends the low-cost, flexible storage of a data lake with the structure and management capabilities of a data warehouse. This combined approach allows organizations to store and manage both structured and unstructured data types on cost-effective cloud storage, while also providing high-performance analytics, data governance, and support for ML and AI workloads on the same data. Databricks currently has the most mature lakehouse implementation. Databricks is well known for handling multimodal data, so the variety of data is not a problem even at high volume.

Clinical processes are heavily regulated. Fortunately, Unity Catalog provides a high level of security and governance across your data, ML, and AI artifacts. Databricks provides a platform that can deliver auditable, regulatory-grade systems in a much more efficient and effective way than siloed data warehouse or other cloud data stacks. Realistically, data provenance alone is not sufficient to align the clinician’s cognitive load with the smaller N; it’s still a very hard problem. Honestly, since we have had lakehouses for some time and have not been able to reliably tackle n-of-1 at scale, the problem can’t soly be with the data system. This is where Agentic AI enters the scene.

Agentic AI

Agentic AI refers to systems of autonomous agents, modular reasoning units that plan, execute, observe, and adapt, orchestrated to complete complex workflows. Architecturally, Agentic AI running on Databricks’ Lakehouse platform uniquely enables safe, scalable N-of-1 systems by co-locating multimodal data, high-throughput model training, low-latency inference, and auditable model governance. This architecture accelerates time-to-effective therapy, reduces clinician cognitive load, and preserves regulatory-grade provenance in ways that are materially harder to deliver on siloed data warehouses or generic cloud stacks. Here are some examples of components of the Agentic AI system that might be used as a foundation for building our N-of-1 therapeutics system. There can and will be more agents, but they will likely be used to enhance or support this basic set.

  • Digital Twin Agents compile the patient’s multimodal state and historic responses.
  • Planner/Policy Agents propose treatment variants (dose, schedule, combination) using constrained optimization informed by transfer learning from cohort data.
  • Evaluation Agents collect outcome signals (biosensors, labs, imaging), compute reward/utility, and update the digital twin.
  • Safety/Compliance Agents enforce clinical constraints, route proposals for clinician review when needed, and produce provenance records

For N-of-1 therapeutics, there are distinct advantages to designing agents to form a closed loop. Let’s discuss why.

Agentic AI Closed Loop System

Agentic AI Closed Loops (AACL)  enable AI systems to autonomously perceive, decide, act, and adapt within self-contained feedback cycles. The term “agentic” underscores the AI’s ability to proactively pursue goals without constant human oversight, while “closed loop” highlights its capacity to refine performance through internal feedback. This synergy empowers AACL systems to move beyond reactive processing, anticipating challenges and optimizing outcomes in real time. This is how we scale AI to realistically address clinician cognitive load within a highly regulated clinical framework.

  • Perception: The AI gathers information from its from its Digital Twin among other sources.
  • Reasoning and Planning: Based on its goals and perceived data of the current test iteration, the AI breaks down the objective into a sequence of actionable steps.
  • Action: The AI executes its plan, often through the Planner/Policy Agents.
  • Feedback and Learning: The system evaluates the outcome of its actions through the Evaluation Agents and compares them against its goals, referencing the Safety/Compliance Agents. It then learns from this feedback to refine its internal models and improve its performance in the next cycle. 

AAIC systems are modular frameworks. Let’s wrap up with a proposed reference architecture or an AAIC system using Databricks.

AAIC on Databricks

We’ll start with a practical implementation of the data layer. Delta Lake provides versioned tables for EHR (FHIR-parquet), structured labs, medication history, genomics variants, and treatment metadata. Time-series data like high-cardinality biosensor streams can be ingested via Spark Structured Streaming into using time-partitioning and compaction. Databricks Lakeflow is a solid tool for this. Patient and cohort embeddings can be stored as vector columns or integrated with a co-located vector index.

The Feature and ETL Layer builds on Lakeflow’s capabilities. A declarative syntax and a UI provide for a low-code solution for building continuous pipelines to normalize clinica code and compute rolling features like time-windowed response metrics. The Databricks Feature Store patterns enable reusable feature views for inputs and predictors.

Databricks provides distributed GPU clusters for the model and agent layer as well as access to foundational and custom AI model. Lakeflow Jobs orchestrate agent execution, coordinate microservices (consent UI, clinician portal, device provisioning), and manage retries.

MLFlow manages most of the heavy lifting for serving and integration. You can serve low latency policy and summarization endpoints while supporting canary deployments and A/B testing. The integration endpoints can supply secure APIs for EHR actionability (SMART on FIHR) and clinician dashboards. You can also ensure the system meets audit and governance standards and practices using the MLFlow model registry as well as Unity Catalog for data/model access control.

Conclusion

Agentic AI closed-loop systems on a Databricks lakehouse offer an auditable, scalable foundation for rapid N-of-1 treatment optimization in precision therapeutics—especially for rare disease and complex oncology—by co-locating multimodal clinical data (omics, biosensors, labs), distributed GPU training, low-latency serving, and model governance (MLflow, Unity Catalog). Implementing Digital Twin, Planner/Policy, Evaluation, and Safety agents in a closed-loop workflow shortens iteration time, reduces clinician cognitive load, and preserves provenance for regulatory compliance, while reusable feature/ETL patterns, time-series versioning (Delta Lake), and vector indexes enable robust validation and canary deployments. Start with a strong data layer, declarative pipelines, and modular agent orchestration, then iterate with clinician oversight and governance to responsibly scale individualized N-of-1 optimizations and accelerate patient-specific outcomes.

Perficient is a Databricks Elite PartnerContact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

 

 

 

]]>
https://blogs.perficient.com/2025/09/29/agentic-ai-closed-loops-n-of-1-treatment-optimization-databricks/feed/ 0 387434
A Recipe to Boost Predictive Modeling Efficiency https://blogs.perficient.com/2025/07/22/a-recipe-to-boost-predictive-modeling-efficiency/ https://blogs.perficient.com/2025/07/22/a-recipe-to-boost-predictive-modeling-efficiency/#respond Tue, 22 Jul 2025 18:48:29 +0000 https://blogs.perficient.com/?p=384894

Implementing predictive analytical insights has become ever so essential for organizations to operate efficiently and remain relevant. What is important while doing this though is to be agile and adaptable. This is much so because what holds valid for a period can easily become obsolete with time. And what is characteristic of a specific group of customers, for example, varies widely with a diverse audience. Therefore, going from an envisioned innovative business idea to a working AI/ML model requires a mechanism that allows for a rapid and AI-driven approach.

In this post, I explain how Databricks, GitHub Copilot and Visual Studio Code IDE (VS Code) together offer an elevated experience when it comes to implementing predictive ML models efficiently. Even with minimal coding and data science experience, one can build, test and deploy predictive models. The synergy I’ve seen that GitHub Copilot has from within VS Code with MLflow and Databricks Experiments is remarkable. Here is how this approach goes.

Prerequisites

Before starting, there are a few one-time setup steps to configure VS Code so it’s well-connected to a Databricks instance. The aim here is to leverage Databricks compute (Serverless works too) which provides easy access to various Unity Catalog components (such as tables, files, and ML models).

Define the Predictive Modeling Agent Prompt in Natural Language

Use the GitHub Copilot Agent with an elaborate plain language prompt that provides the information it needs to devise the complete solution. Here is where the actual effort really is. I will list important points to include in the agent prompt that I discovered produce a more successful outcome with less iterations.

  • Data Sources: Tell the Agent about the source data, and not just in technical terms but also functionally so it considers the business domain that it applies to. You can provide the table names where it will source data from in the Unity Catalog and Schema. It also helps to explain the main columns in the source tables and what the significance of each column is. This enables the agent to make more informed decisions on how to use the source data and whether it will need to transform it. The explanations also result in better feature engineering decisions to feed into the ML models.
  • Explain the Intended Outcome: Here is where one puts their innovative idea in words. What is the business outcome? What type of prediction are you looking for? Are there multiple insights that need to be determined? Are there certain features of the historical data that need to be given greater weight when determining the next best action or a probability of an event occurring? In addition to predicting events, are you interested in knowing the expected timeline for an event to occur?
  • Databricks Artifact Organization: If you’re looking to stick to standards followed in managing Databricks content, you can provide additional directions as part of the prompt. For instance, what are the exact names to use for notebooks, tables, models, etc. It also helps to be explicit in how VS Code will run the code. Instructing it to use Databricks Connect using a Default Serverless compute configuration eliminates the need to manually setup a Databricks connection through code. In addition, instructing the agent to leverage the Databricks Experiment capability to enable model accessibility through the Databricks UI ensures that one can easily monitor model progress and metrics.
  • ML Model Types to Consider: Experiments in Databricks are a great way of effectively comparing several algorithms simultaneously (e.g., Random Forest, XGBoost, Logistic Regression, etc.). If you have a good idea of what type of ML algorithms are applicable for your use case, you can include one or more of these in the prompt so the generated experiment is more tailored. Alternatively, let the agent recommend several ML models that are most suitable for the use case.
  • Operationalizing the Models: In the same prompt one can provide instructions on choosing the most accurate model, registering it in a unity catalog and applying it to new batch or stream data inferences. You can also be specific on which activities will be organized together as combined vs separate notebooks for ease of scheduling and maintenance.
  • Synthetic Data Generation: Sometimes data is not readily available to experiment with but one has a good idea of what it will look like. Here is where Copilot and python faker library are advantageous in synthesizing mockup data that mimic real data. This may be necessary not just for creating experiments but for testing models as well. Including instructions in the prompt for what type of synthetic data to generate allows Copilot to integrate cells in the notebook for that purpose.

With all the necessary details included in the prompt, Copilot is able to interpret the intent and generate a structured Python notebook with organized cells to handle:

  • Data Sourcing and Preprocessing
  • Feature Engineering
  • ML Experiment Setup
  • Model Training and Evaluation
  • Model Registration and Deployment

All of this is orchestrated from your local VS Code environment, but executed on Databricks compute, ensuring scalability and access to enterprise-grade resources.

The Benefits

Following are key benefits to this approach:

  • Minimal Coding Required: This applies not just for the initial model tuning and deployment but for improvement iterations also. If there is a need to tweak the model, just follow up with the Copilot Agent in VS Code to adjust the original Databricks notebooks, retest and deploy them.
  • Enhanced Productivity: By leveraging the Databricks Experiment APIs we’re able to automate tasks like creating experiments, logging parameters, metrics, and artifacts within training scripts, and integrate MLflow tracking into CI/CD pipelines. This allows for seamless, repeatable workflows without manual intervention. Programmatically registering, updating, and managing model versions in the MLflow Model Registry, is more streamlined through the APIs used in VS Code.
  • Leverage User Friendly UI Features in Databricks Experiments: Even though the ML approach described here is ultimately driven by code that is auto generated, that doesn’t mean we’re unable to take advantage of the rich Databricks Experiments UI. As the code executes in VS Code on Databricks compute, we’re able to login to the Dababricks interactive environment to inspect individual runs, review logged parameters, metrics, and artifacts, and compare different runs side-by-side to debug models or understand experimental results.

In summary, the synergy between GitHub Copilot, VS Code, and Databricks empowers users to go from idea to deployed ML models in hours, not weeks. By combining the intuitive coding assistance of GitHub Copilot with the robust infrastructure of Databricks and the flexibility of VSCode, predictive modeling becomes accessible and scalable.

]]>
https://blogs.perficient.com/2025/07/22/a-recipe-to-boost-predictive-modeling-efficiency/feed/ 0 384894
Salesforce to Databricks: A Deep Dive into Integration Strategies https://blogs.perficient.com/2025/07/15/salesforce-to-databricks-a-deep-dive-into-integration-strategies/ https://blogs.perficient.com/2025/07/15/salesforce-to-databricks-a-deep-dive-into-integration-strategies/#respond Tue, 15 Jul 2025 15:18:29 +0000 https://blogs.perficient.com/?p=384550

Supplementing Salesforce with Databricks as an enterprise Lakehouse solution brings advantages for various personas across an organization. Customer experience data is highly valued when it comes to driving personalized customer journeys leveraging company-wide applications beyond Salesforce. From enhanced customer satisfaction to tailored engagements and offerings that drive business renewals and expansions, the advantages are hard to miss. Databricks maps data from a variety of enterprise apps, including those used by Sales, Marketing and Finance. Consequently, layering Databricks Generative AI and predictive ML capabilities provide easily accessible best-fit recommendations that help eliminate challenges and highlight success areas within your company’s customer base.

In this blog, I elaborate on the different methods whereby Salesforce data is made accessible from within Databricks. While accessing Databricks data from Salesforce is possible, it is not the topic of this post and will perhaps be tackled in a later blog. I have focused on the built-in capabilities within both Salesforce and Databricks and have therefore excluded 3rd party data integration platforms. There are three main ways to achieve this integration:

  1. Databricks Lakeflow Ingestion from Salesforce
  2. Databricks Query Federation from Salesforce Data Cloud
  3. Databricks Files Sharing from Salesforce Data Cloud

Choosing the best approach to use depends on your use case. The decision is driven by several factors, such as the expected latency of accessing the latest Salesforce data, the complexity of the data transformations needed, and the volume of Salesforce data of interest. And it may very well be that more than one method is implemented to cater for different requirements.

While the first method copies the raw Salesforce data over to Databricks, methods 2 and 3 offer no-copy alternatives, thus leveraging Salesforce Data Cloud itself as the raw data layer. The no-copy alternatives are great in that they leverage Salesforce’s native capability of managing its own data lake thus eliminating overhead by redoing that effort. However, there are limitations to doing that, depending on the use case. The matrix below presents how each method compares when factoring in the key criteria for integration.

Method Lakeflow Ingestion Salesforce Data Cloud Query Federation Salesforce Data Cloud File Sharing
Type Data Ingestion Zero-Copy Zero-Copy
Supports Salesforce Data Cloud as a Source? ✔ Yes ✔ Yes ✔ Yes
Incremental Data Refreshes ✔ Automated processing into Databricks based on SF standard timestamp fields. Formula fields always require a full refresh of the formulas. ✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
Processing of Soft Deletes ✔ Yes Supported incrementally ✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
Processing of Hard Deletes Requires a full refresh ✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
Query Response Time ✔ Best as data is queried from a local copy and processed within Databricks ⚠ Slower as query response is dependent on SF Data Cloud, and data has to travel across networks ⚠ Slower as data travels across networks
Supports Real-Time Querying? No

The pipeline runs on a schedule to copy data for example, hourly, daily, etc.

✔ Yes

Live query execution on SF Data Cloud
(Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)

✔ Yes

Live data sourced from SF Data Cloud
(Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)

Supports Databricks Streaming Pipelines? ✔ Yes, With Declarative Pipelines into Streaming tables (DLT) (runs as micro-batch jobs) No No
Suitable for High Data Volume? ✔ Yes
SF Bulk API is called for high data volumes such as initial loads, and SF REST API is used for lower data volumes such as limited data volume incremental loads.
No
Reliant on JDBC Query Pushdown limitations and SF performance
⚠ Moderate
This method is more suitable than Query Federation when it comes to zero-copy with high volumes of data.
Supports Data Transformation ⚠ No direct transformation. Ingests SF objects as is. Transformation happens downstream in the Declarative Pipeline. ✔ Yes. DBRX pushes queries over to Salesforce using JDBC protocol. ✔ Yes. Transformations execute on Databricks compute
Protocol SF REST API and Bulk API over HTTPS JDBC over HTTPS Salesforce Data Cloud DaaS APIs over HTTPS (file-based access)
Scalability Up to 250 objects per pipeline. Multiple pipelines are allowed. Depending on SF Data Cloud performance when running transformation with multiple objects Up to 250 Data Cloud objects may be included in a data share. Up to 10 data shares.
Salesforce Prerequisites API-enabled Salesforce user with access to desired objects Salesforce Data Cloud must be available.

Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population.

Enable JDBC API access to Data Cloud.

Salesforce Data Cloud must be available.

Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population.

Data share target is created in SF with shared objects.

If you’re looking for guidance on leveraging Databricks with Salesforce, reach out to Perficient for a discussion with Salesforce and Databricks specialists.

]]>
https://blogs.perficient.com/2025/07/15/salesforce-to-databricks-a-deep-dive-into-integration-strategies/feed/ 0 384550
Databricks Lakebase – Database Branching in Action https://blogs.perficient.com/2025/07/04/databricks-lakebase-database-branching-in-action/ https://blogs.perficient.com/2025/07/04/databricks-lakebase-database-branching-in-action/#respond Fri, 04 Jul 2025 07:17:16 +0000 https://blogs.perficient.com/?p=383982

What is Databricks Lakebase?

Databricks Lakebase is a Postgres OLTP engine, integrated into Databricks Data Intelligence Platform. A database instance is a compute type that provides fully managed storage and compute resources for a postgres database. Lakebase leverages an architecture that separates compute and storage, which allows independent scaling while supporting low latency (<10ms) and high concurrency transactions.

Databricks has integrated this powerful postgres engine along with sophisticated capabilities that are benefited by Databricks recent acquisition of Neon. Lakebase is fully managed by Databricks, which means no infrastructure has to be provisioned and maintained separately. In addition to traditional OLTP engine, Lakebase comes with below features,

  • Openness: Lakebase are built on open-source standards
  • Storage and compute separation: Lakebase stores data in data lakes in open format. It enables scaling storage and compute independently.
  • Serverless: Lakebase is lightweight, meaning it can scale instantly up and down based on the load. It can scale down to zero, at which the cost of the lakebase is just for the storage  of data only. No compute cost will be applied.
  • Modern development workflow: Branching a database is as simple as branching a code repository. It is done near instantly.
  • Built for AI Agents: Lakebases are designed to support a large number of AI agents. It’s branching and checkpointing capabilities enable AI agents to experiment and rewind to any point in time.
  • Lakehouse Integration: Lakebase make it easy to combine operational, analytical and AI systems without complex ETL pipelines.

In this article, we shall discuss in detail about how database branching feature works in Lakebase.

Database Branching

Database branching is one of the unique features introduced in Lakebase, that enables to branch out a database. It resembles the exact behavior of how code branch could be branched out from an existing branch.

Branching database is beneficial for an isolated test environment or point in time recovery. Lakebase uses copy-on-write branching mechanism to create an instant zero-copy clone of the database, with dedicated compute to operate on that branch. With zero-copy clone, it enables to create a branch of parent database of any size instantly.

The child branch is managed independently of the parent branch. With child isolated database branch, one can perform testing/debugging in the production copy of data. Though both parent and child databases appear separate, physically both instances would be pointing to same data pages. Under the hood, child database will be pointing to the actual data pages which parent is pointing to. When a change occurs in any of the data in child branch, then a new data page will be created with the new changes, and it will be available only to the branch. Any changes done in branch will not reflect in parent branch.

How branching works

The below diagrams represent how database branching works under the hood,

Database Branching

Database Branching Updates

Lakebase in action

Here is the demonstration of how Lakebase instance can be created, branch out an instance and how table changes behave,

To create Lakebase instance, login Databricks and navigate to Compute -> OLTP Database tab -> Click “Create New Instance” button,

Create New Instance 01

Create New Instance Success 02

Click “New Query” to launch SQL Editor for PostgreSQL Database. In current instance, let’s create a new table and add some records.

Instance1 Create Table 03

Instance1 Query Table 04

Let’s create a database branch “pginstance2” from instance “pginstance1”. Goto Compute –> OLTP Database –> Create Database instance

Enter new instance name and expand “Advanced Settings” -> Enable “Create from parent” option -> Enter the source instance name “pginstance1”.

Under “Include data from parent up to”, select “Current point in time” option. Here, we can choose any specific point in time instance too.

Create Instance2 05

Instance2 Create Success 06

Launch SQL Editor from pginstance2 database instance and query tbl_user_profile table

Instance2 Query Table 07

Now, let’s insert new record and update an existing record in the tbl_user_profile table in pginstance2,

Instance2 Update Table 08

Now, let’s switch back to parent database instance pginsntance1 and query tbl_user_profile table. The table in pginsntance1 should still be only 3 records. All the changes done in tbl_user_profile table should be available only in pginstance2.

Instance1 Query Table 09

Conclusion

Database changes that are done in one branch will not impact/reflect in another branch, thereby provide clear isolation of database at scale. Currently Lakebase do not have a feature to merge database branch. However, Databricks is committed and working towards database merge capability in near future.

]]>
https://blogs.perficient.com/2025/07/04/databricks-lakebase-database-branching-in-action/feed/ 0 383982
Celebrating Perficient’s Third Databricks Champion https://blogs.perficient.com/2025/07/03/celebrating-perficients-third-databricks-champion/ https://blogs.perficient.com/2025/07/03/celebrating-perficients-third-databricks-champion/#comments Thu, 03 Jul 2025 20:12:36 +0000 https://blogs.perficient.com/?p=383928

We’re excited to welcome Bamidele James as Perficient’s newest and third Databricks Champion!  His technical expertise, community engagement, advocacy, and mentorship have made a profound impact on the Databricks ecosystem.

His Nomination Journey

Bamidele’s journey through the nomination process was vigorous. It required evidence that he has successfully delivered multiple Databricks projects, received several certifications, completed an instructor-led training course, and participated in a panel interview with the Databricks committee.

What This Achievement Means

This achievement represents peer and leadership recognition of Bamidele’s knowledge, contributions, and dedication to building strong partnerships. It also brings him a sense of purpose and pride to know that his work has made a real impact, and his continuous efforts are appreciated.

Contributing to Databricks’ and Perficient’s Growth

Bamidele plays a pivotal role in helping our clients unlock the full potential of Databricks by aligning Perficient’s Databricks capabilities with their business goals. He enables enterprise customers to accelerate their data and AI transformation to deliver measurable outcomes like reduced time-to-insight, improved operational efficiency, and increased revenue. In addition, Bamidele has led workshops, held executive briefings, and developed proof of concepts that help our clients drive adoption and deepen customer engagement.

“Being a Databricks Champion affirms that my contributions, knowledge, and dedication to building strong partnerships are recognized by my peers and leadership.” – Bamidele James, Technical Architect

 Skills Key to This Achievement

Many skills and proficiencies—including data engineering and architecture, machine learning and AI, cloud platforms, data governance and security, solution selling, stakeholder management, and strategic thinking—played a part in Bamidele becoming a Databricks Champion. To anyone wishing to follow a similar path, Bamidele recommends mastering the platform, attaining deep technical expertise, and focusing on real-world impact.

Looking Ahead

Bamidele looks forward to using Databricks to create innovative tools and solutions that drive success for our clients. He’s also excited about trends and Databricks innovations including multi-tab notebooks, Databricks Lake Flow, the new SQL interface, and SQL pipeline syntax.

Perficient + Databricks

Perficient is proud to be a trusted Databricks elite consulting partner with more than 130 certified consultants. We specialize in delivering tailored data engineering, analytics, and AI solutions that unlock value and drive business transformation.

Learn more about our Databricks partnership.

 

 

 

]]>
https://blogs.perficient.com/2025/07/03/celebrating-perficients-third-databricks-champion/feed/ 1 383928
Unlocking Business Success with Databricks One https://blogs.perficient.com/2025/06/30/databricks-one-business-data-ai/ https://blogs.perficient.com/2025/06/30/databricks-one-business-data-ai/#respond Mon, 30 Jun 2025 20:54:00 +0000 https://blogs.perficient.com/?p=383743

Business users don’t use notebooks. Full stop. And for that reason, most organizations don’t have business users accessing the Databricks UI. This has always been a fundamental flaw in Databricks’ push to democratize data and AI. This disconnect is almost enshrined in the medallion architecture: Bronze is for system accounts, data scientists with notebooks use the Silver layer, and Gold is for business users with reporting tools. This approach has been enough to take an organization part of the way towards self-service analytics. This approach is not working for GenAI, though. This was a major frustration with Genie Spaces. It was a tool made for business users but embedded in an IT interface. Databricks One is looking to change all that.

Using Databricks One

Databricks One is a unified platform experience that provides business users with a single point of entry into their data ecosystem. It removes technical complexity and offers a curated environment to interact with data, AI models, dashboards, and apps efficiently. Core features of Databricks One include:

  • AI/BI Dashboards: Users can view, explore, and drill into key KPIs and metrics without technical setup.
  • AI/BI Genie: A conversational AI interface allowing users to ask natural language questions like “Why did sales drop in April?” or “What are the top-performing regions?”
  • Custom Databricks Apps: Tailored applications that combine analytics, workflows, and AI models to meet specific business needs.
  • Content Browsing by Domain: Content is organized into relevant business areas such as “Customer 360” and “Marketing Campaign Performance,” fostering easy discovery and collaboration.

Administering Databricks One

Administrators can give users access to Databricks One via a consumer access entitlement. This is a basic, read-only entry point for business users that gives access to a simplified workspace that focuses on consuming dashboards, Genie spaces and Apps. Naturally, users will be working with Unity Catalog’s unified data access controls to maintain governance and security.

Conclusion

This is a very short blog because I try not to comment too early on pre-release features and Databricks One is scheduled for a beta release later this summer. This is more than just an incremental feature for a lot of our enterprise clients, though. I am looking at Databricks One as a fundamental architectural component for large enterprise implementations. I feel this is a huge step forward for practical data and intelligence democratization and I was just too excited to wait for more details.

Perficient is a Databricks Elite PartnerContact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/06/30/databricks-one-business-data-ai/feed/ 0 383743
Understanding Clean Rooms: A Comparative Analysis Between Databricks and Snowflake https://blogs.perficient.com/2025/06/27/databricks-vs-snowflake-clean-rooms-financial-research-collaboration/ https://blogs.perficient.com/2025/06/27/databricks-vs-snowflake-clean-rooms-financial-research-collaboration/#respond Fri, 27 Jun 2025 21:45:04 +0000 https://blogs.perficient.com/?p=383614

Clean rooms” have emerged as a pivotal data sharing innovation with both Databricks and Snowflake providing enterprise alternatives.

Clean rooms are secure environments designed to allow multiple parties to collaborate on data analysis without exposing sensitive details of data. They serve as a sandbox where participants can perform computations on shared datasets while keeping raw data isolated and secure. Clean rooms are especially beneficial in scenarios like cross-company research collaborations, ad measurement in marketing, and secure financial data exchanges.

Uses of Clean Rooms:

  • Data Privacy: Ensures that sensitive information is not revealed while still enabling data analysis.
  • Collaborative Analytics: Allows organizations to combine insights without sharing the actual data, which is vital in sectors like finance, healthcare, and advertising.
  • Regulatory Compliance: Assists in meeting stringent data protection norms such as GDPR and CCPA by maintaining data sovereignty.

Clean Rooms vs. Data Sharing

While clean rooms provide an environment for secure analysis, data sharing typically involves the actual exchange of data between parties. Here are the major differences:

  • Security:
    • Clean Rooms: Offer a higher level of security by allowing analysis without exposing raw data.
    • Data Sharing: Involves sharing of datasets, which requires robust encryption and access management to ensure security.
  • Control:
    • Clean Rooms: Data remains under the control of the originating party, and only aggregated results or specific analyses are shared.
    • Data Sharing: Data consumers can retain and further use shared datasets, often requiring complex agreements on usage.
  • Flexibility:
    • Clean Rooms: Provide flexibility in analytics without the need to copy or transfer data.
    • Data Sharing: Offers more direct access, but less flexibility in data privacy management.

High-Level Comparison: Databricks vs. Snowflake

Implementation
Databricks Snowflake
  1. Setup and Configuration:
    • Utilize existing Databricks workspace
    • Create a new Clean Room environment within the workspace
    • Configure Delta Lake tables for shared data
  2. Data Preparation:
    • Use Databricks’ data engineering capabilities to ETL and anonymize data
    • Leverage Delta Lake for ACID transactions and data versioning
  3. Access Control:
    • Implement fine-grained access controls using Unity Catalog
    • Set up row-level and column-level security
  4. Collaboration:
    • Share Databricks notebooks for collaborative analysis
    • Use MLflow for experiment tracking and model management
  5. Analysis:
    • Utilize Spark for distributed computing
    • Support for SQL, Python, R, and Scala in the same environment
  1. Setup and Configuration:
    • Set up a separate Snowflake account for the Clean Room
    • Create shared databases and views
  2. Data Preparation:
    • Use Snowflake’s data engineering features or external tools for ETL
    • Load prepared data into Snowflake tables
  3. Access Control:
    • Implement Snowflake’s role-based access control
    • Use secure views and row access policies
  4. Collaboration:
  5. Analysis:
    • Primarily SQL-based analysis
    • Use Snowpark for more advanced analytics in Python or Java
Business and IT Overhead
Databricks Snowflake
  • Lower overhead if already using Databricks for other data tasks
  • Unified platform for data engineering, analytics, and ML
  • May require more specialized skills for advanced Spark operations
  • Easier setup and management for pure SQL users
  • Less overhead for traditional data warehousing tasks
  • Might need additional tools for complex data preparation and ML workflows
Cost Considerations
Databricks Snowflake
  • More flexible pricing based on compute usage
  • Can optimize costs with proper cluster management
  • Potential for higher costs with intensive compute operations
  • Predictable pricing with credit-based system
  • Separate storage and compute pricing
  • Costs can escalate quickly with heavy query usage
Security and Governance
Databricks Snowflake
  • Unity Catalog provides centralized governance across clouds
  • Native integration with Delta Lake for ACID compliance
  • Comprehensive audit logging and lineage tracking
  • Strong built-in security features
  • Automated data encryption and key rotation
  • Detailed access history and query logging
Data Format and Flexibility
Databricks Snowflake
  • Supports various data formats (structured, semi-structured, unstructured)
  • Supports various file formats (Parquet, Iceberg, csv,json, images, etc.)
  • Better suited for large-scale data processing and transformations
  • Optimized for structured and semi-structured data
  • Excellent performance for SQL queries on large datasets
  • May require additional effort for unstructured data handling
Advanced Analytics, AI and ML
Databricks Snowflake
  • Native support for advanced analytics and AI/ML workflows
  • Integrated with popular AI/ML libraries and MLflow
  • Easier to implement end-to-end AI/ML pipeline
  • Requires additional tools or Snowpark for advanced analytics
  • Integration with external ML platforms needed for comprehensive ML workflows
  • Strengths lie more in data warehousing than in ML operations
Scalability
Databricks Snowflake
  • Auto-scaling of compute clusters and serverless compute options
  • Better suited for processing very large datasets and complex computations
  • Automatic scaling and performance optimization
  • May face limitations with extremely complex analytical workloads

Use Case Example: Financial Services Research Collaboration

Consider a research department within a financial services firm that wants to collaborate with other institutions on developing market insights through data analytics. They face a challenge: sharing proprietary and sensitive financial data without compromising security or privacy. Here’s how utilizing a clean room can solve this:

Implementation in Databricks:

  • Integration: By setting up a clean room in Databricks, the research department can securely integrate its datasets with other institutions; allowing sharing of data insights with precise access controls.
  • Analysis: Researchers from various departments can perform joint analyses on combined datasets without ever directly accessing each other’s raw data.
  • Security and Compliance: Databricks’ security features such as encryption, audit logging, and RBAC will ensure that all collaborations comply with regulatory standards.

Through this setup, the financial services firm’s research department can achieve meaningful collaboration and derive deeper insights from joint analyses, all while maintaining data privacy and adhering to compliance requirements.

By leveraging clean rooms, organizations in highly regulated industries can unlock new opportunities for innovation and data-driven decision-making without the risks associated with traditional data sharing methods.

Conclusion

Both Databricks and Snowflake offer robust solutions for implementing this financial research collaboration use case, but with different strengths and considerations.

Databricks excels in scenarios requiring advanced analytics, machine learning, and flexible data processing, making it well-suited for research departments with diverse analytical needs. It offers a more comprehensive platform for end-to-end data science workflows and is particularly advantageous for organizations already invested in the Databricks ecosystem.

Snowflake, on the other hand, shines in its simplicity and ease of use for traditional data warehousing and SQL-based analytics. Its strong data sharing capabilities and familiar SQL interface make it an attractive option for organizations primarily focused on structured data analysis and those with less complex machine learning requirements.

Regardless of the chosen platform, the implementation of Clean Rooms represents a significant step forward in enabling secure, compliant, and productive data collaboration in the financial sector. As data privacy regulations continue to evolve and the need for cross-institutional research grows, solutions like these will play an increasingly critical role in driving innovation while protecting sensitive information.

Perficient is both a Databricks Elite Partner and a Snowflake Premier PartnerContact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

 

]]>
https://blogs.perficient.com/2025/06/27/databricks-vs-snowflake-clean-rooms-financial-research-collaboration/feed/ 0 383614
Transforming Your Data Strategy with Databricks Apps: A New Frontier https://blogs.perficient.com/2025/06/24/transforming-data-strategy-databricks-apps/ https://blogs.perficient.com/2025/06/24/transforming-data-strategy-databricks-apps/#comments Tue, 24 Jun 2025 21:10:30 +0000 https://blogs.perficient.com/?p=383415

I’ve been coding in notebooks for so long, I forgot how much I missed a nice, deployed application. I also didn’t realize how this was limiting my solution space. Then I started working with Databricks Apps.

Databricks Apps are designed to extend the functionality of the Databricks platform, providing users with enriched features and capabilities tailored to specific data needs. These apps can significantly enhance the data processing and analysis experience, offering bespoke solutions to address complex business requirements.

Key Features of Databricks Apps

  1. Custom Solutions for Diverse Needs: Databricks Apps are built to cater to a wide range of use cases, from data transformation and orchestration to predictive analytics and AI-based insights. This versatility allows organizations to deploy applications that directly align with their specific business objectives.
  2. Seamless Integration: The apps integrate smoothly within the existing Databricks environment, maintaining the platform’s renowned ease of use and ensuring that deployment does not disrupt current data processes. This seamless integration is crucial for maintaining operational efficiency and minimizing transition challenges.
  3. Scalability and Flexibility: Databricks Apps are designed to scale with your organization’s needs, ensuring that as your data requirements grow, the solutions deployed through these apps can expand to meet those demands without compromising performance.
  4. Enhanced Collaboration: By leveraging apps that foster collaboration, teams can work more effectively across different departments, sharing insights and aligning strategic goals with more precision and cohesion.

Benefits for Architects

  1. Tailored Data Solutions: Databricks Apps enables architects to deploy tailored solutions that meet their unique data challenges, ensuring that technical capabilities are closely aligned with strategic business goals.
  2. Accelerated Analytics Workflow: By using specialized apps, organizations can significantly speed up their data analytics workflows, leading to faster insights and more agile decision-making processes, essential in today’s fast-paced business environment.
  3. Cost Efficiency: The capability to integrate custom-built apps reduces the need for additional third-party tools, potentially lowering overall costs and simplifying vendor management.
  4. Future-Proofing Data Strategies: With the rapid evolution of technology, having access to a continuously expanding library of Databricks Apps helps organizations stay ahead of trends and adapt swiftly to new data opportunities and challenges.

Strategies for Effectively Leveraging Databricks Apps

To maximize the potential of Databricks Apps, CIOs and CDOs should consider the following approaches:

  • Identify Specific Use Cases: Before adopting new apps, identify the specific data operations and challenges your organization is facing. This targeted approach ensures that the apps you choose provide the most value.
  • Engage with App Developers: Collaborate with app developers who specialize in delivering comprehensive solutions tailored to your industry. Their expertise can enhance the implementation process and provide insights into best practices.
  • Promote Cross-Department Collaboration: Encourage departments across your organization to utilize these apps collaboratively. The synergistic use of advanced data solutions can drive more insightful analyses and foster a unified strategic direction.
  • Assess ROI Regularly: Continuously assess the return on investment from using Databricks Apps. This evaluation will help in determining their effectiveness and in making data-driven decisions regarding future app deployments.

Conclusion

Databricks Apps present a powerful opportunity for CIOs and CDOs to refine and advance their data strategies by offering tailored, scalable, and integrated solutions. By embracing these tools, organizations can transform their data-driven operations to gain a competitive edge in an increasingly complex business landscape.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/06/24/transforming-data-strategy-databricks-apps/feed/ 1 383415
Exploring the Free Edition of Databricks: A Risk-Free Approach to Enterprise AI https://blogs.perficient.com/2025/06/24/explore-databricks-free-edition-risk-free-analytics/ https://blogs.perficient.com/2025/06/24/explore-databricks-free-edition-risk-free-analytics/#respond Tue, 24 Jun 2025 20:53:39 +0000 https://blogs.perficient.com/?p=383411

Databricks announced a full, free version of the platform at the Data and AI Summit. While the Free Edition is targeted to students and hobbyists, I also see opportunities where enterprise architects can effectively evangelize Databricks without going through Procurement for a license. Choosing the right platform to manage, analyze, and extract insights from massive datasets is crucial, especially with new and emerging GenAI use cases. We have seen many clients paralyzed by the combination of moving to a cloud database, comparing and contrasting the different offerings, and doing all of this analysis with only a very murky picture of what the new AI-driven future holds. The Community Edition has always been free, but it has not been feature-complete. With its new Free Edition, Databricks presents an exceptional opportunity for organizations to test its capabilities with no financial commitment or risk.

What is Databricks Free Edition?

The Free Edition of Databricks is designed to provide users with full access to Databricks’ core functionalities, allowing them to explore, experiment, and evaluate the platform’s potential without any initial investment. This edition is an excellent entry point for organizations looking to understand how Databricks can fit into their data strategy, providing a hands-on experience with the platform’s features.

Key Features of Databricks Free Edition

  1. Simplified Setup and Onboarding: The Free Edition offers a straightforward setup process. Users can easily create an account and start exploring Databricks’ environment in a matter of minutes. This ease of access is ideal for decision-makers who want to quickly assess Databricks’ capabilities.
  2. Complete Workspace Experience: Users of the Free Edition get access to a complete workspace, which includes all the necessary tools for data engineering, data science, and machine learning. This enables organizations to evaluate the entire data lifecycle on the Databricks platform.
  3. Scalability and Performance: While the Free Edition is designed for evaluation purposes, it still provides a glimpse into the scalability and performance efficiency that Databricks is known for. Organizations can run small-scale analytics and machine learning tests to gauge how the platform handles data processing and computation tasks.
  4. Community Support and Resources: Users can benefit from the extensive Databricks community, which offers support, tutorials, and shared resources. This can be particularly valuable for organizations exploring Databricks for the first time and wanting to leverage shared knowledge.
  5. No Time Constraints: Unlike typical trial versions, the Free Edition does not impose a time limit, allowing organizations to explore the platform at their own pace. This flexibility is essential for CIOs and CDOs who might need extended periods to evaluate the platform’s potential fully.

Benefits for CIOs and CDOs

  1. Risk-Free Evaluation: The primary advantage of the Free Edition is the risk-free nature of the exploration. CIOs and CDOs can test the platform’s capabilities without signing contracts or making financial commitments, aligning with their careful budget management strategies.
  2. Strategic Insights for Data Strategy: By exploring Databricks firsthand, decision-makers can gain strategic insights into how the platform integrates with existing systems and processes. This understanding is crucial when considering a transition to a new data analytics platform.
  3. Hands-On Experience: Direct interaction with Databricks helps bridge the gap between executive strategy and technical implementation. By experiencing the platform themselves, developers and architects can better champion its adoption across the organization.
  4. Pre-Deployment Testing: The Free Edition enables organizations to test specific use cases and data workflows, helping identify any challenges or concerns before full deployment. This pre-deployment testing ensures that any transition to Databricks is smooth and well-informed.
  5. Benchmarking Against Other Solutions: As organizations evaluate various data platforms, the Free Edition allows Databricks to be benchmarked against other solutions in the market. This comparison can be crucial in making informed decisions that align with long-term strategic goals.

Maximizing the Use of Databricks Free Edition

To maximize the benefits of Databricks Free Edition, CIOs and CDOs should consider the following strategies:

  • Define Use Cases: Before diving into the platform, define specific use cases you want to test. This could include data processing efficiency, machine learning model training, or real-time analytics capabilities. Clear objectives will provide focus and measurable outcomes.
  • Leverage Community Resources: Engage with the Databricks community to explore case studies, tutorials, and shared solutions that can offer fresh perspectives and innovative ideas.
  • Collaborate with Data Teams: Involve your data engineering and science teams early in the evaluation process. Their input and expertise will be invaluable in testing and providing feedback on the platform’s performance.
  • Evaluate Integration Points: During your exploration, assess how well Databricks integrates with existing systems and cloud services within your organization. Seamless integration is vital for minimizing disruption and maximizing workflow efficiency.

Conclusion

The Databricks Free Edition is an invaluable opportunity for CIOs and CDOs to explore the transformative potential of big data analytics on a leading platform without any associated risks.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/06/24/explore-databricks-free-edition-risk-free-analytics/feed/ 0 383411
Exploring Lakebase: Databricks’ Next-Gen AI-Native OLTP Database https://blogs.perficient.com/2025/06/22/introduction-to-databricks-lakebase-for-ai-driven-applications/ https://blogs.perficient.com/2025/06/22/introduction-to-databricks-lakebase-for-ai-driven-applications/#respond Mon, 23 Jun 2025 01:02:29 +0000 https://blogs.perficient.com/?p=383327

Lakebase is Databricks‘ OLTP database and the latest member of its ML/AI offering. Databricks has incorporated various components to support its AI platform, including data components. The Feature Store has been available for some time as a governed, centralized repository that manages machine learning features throughout their lifecycle. Mosaic AI Vector Search is a vector index optimized for storing and retrieving embeddings, particularly for similarity searches and RAG use cases.

What’s Old is New Again

AI’s need for data demands that transactional and analytical workflows no longer be viewed as separate entities. Traditional OLTP databases were never designed to meet the speed and flexibility required by AI applications today. They often exist outside analytics frameworks, creating bottlenecks and requiring manual data integrations. Notably, databases are now being spun up by AI agents rather than human operators. The robustness of the transactional database’s query response time now needs to be augmented with an equally robust administrative response time.

Lakebase addresses these challenges by revolutionizing OLTP database architecture. Its core attributes—separation of storage and compute, openness, and serverless architecture—make it a powerful tool for modern developers and data engineers.

Key Features of Lakebase

1. Openness:

Built on the open-source Postgres framework, Lakebase ensures compatibility and avoids vendor lock-in. The open ecosystem promotes innovation and provides a versatile foundation for building sophisticated data applications.

2. Separation of Storage and Compute:

Lakebase allows independent scaling of storage and computation, reducing costs and improving efficiency. Data is stored in open formats within data lakes, offering flexibility and eliminating proprietary data lock-in.

3. Serverless Architecture:

Lakebase is designed for elasticity. It scales up or down automatically, even to zero, ensuring you’re only paying for what you use, making it a cost-effective solution.

4. Integrated with AI and the Lakehouse:

Swift integration with the Lakehouse platform means no need for complex ETL pipelines. Operational and analytical data flows are synchronized in real-time, providing a seamless experience for deploying AI and machine learning models.

5. AI-Ready:

The database design caters specifically to AI agents, facilitating massive AI team operations through branching and checkpoint capabilities. This makes development, experimentation, and deployment faster and more reliable.

Use Cases and Benefits

1. Real-Time Applications:

From e-commerce systems managing inventory while providing instant recommendations, to financial services executing automated trades, Lakebase supports low-latency operations critical for real-time decision-making.

2. AI and Machine Learning:

With built-in AI and machine learning capabilities, Lakebase supports feature engineering and real-time model serving, thus accelerating AI project deployments.

3. Industry Applications:

Different sectors like healthcare, retail, and manufacturing can leverage Lakebase’s seamless data integration to enhance workflows, improve customer relations, and automate processes based on real-time insights.

Getting Started with Lakebase

Setting up Lakebase on Databricks is a straightforward process. With a few clicks, users can provision PostgreSQL-compatible instances and begin exploring powerful data solutions. Key setup steps include enabling Lakebase in the Admin Console, configuring database instances, and utilizing the Lakebase dashboard for management.

Conclusion

Lakebase is not just a database; it’s a paradigm shift for OLTP systems in the age of AI. By integrating seamless data flow, offering flexible scaling, and supporting advanced AI capabilities, Lakebase empowers organizations to rethink and innovate their data architecture. Now is the perfect moment to explore Lakebase, unlocking new possibilities for intelligent and real-time data applications.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/06/22/introduction-to-databricks-lakebase-for-ai-driven-applications/feed/ 0 383327
Lakeflow: Revolutionizing SCD2 Pipelines with Change Data Capture (CDC) https://blogs.perficient.com/2025/06/21/lakeflow-revolutionizing-scd2-pipelines-with-change-data-capture-cdc/ https://blogs.perficient.com/2025/06/21/lakeflow-revolutionizing-scd2-pipelines-with-change-data-capture-cdc/#respond Sun, 22 Jun 2025 00:56:47 +0000 https://blogs.perficient.com/?p=383315

Several breakthrough announcements emerged at DAIS 2025, but the Lakeflow updates around building robust pipelines had the most immediate impact on my current code. Specifically, I can now see a clear path to persisting SCD2 (Slowly Changing Dimension Type 2) tables in the silver layer from mutable data sources. If this sentence resonates with you, we share a common challenge. If not, it soon will.

Maintaining history through Change Data Capture is critical for both AI and foundational use cases like Single View of the Customer. However, the current state of Delta Live Tables (DLT) pipelines only allows streaming tables to maintain SCD2 logic, while most data sources permit updates. Let’s dive into the technical challenges and how Lakeflow Connect is solving them.

Slowly Changing Dimensions

There are two options for managing changes: SCD1 and SCD2.

  1. SCD Type 1 is focused on keeping only the latest data. This approach involves overwriting old data with new data whenever a change occurs. No history of changes is kept, and only the latest version of the data is available. This is useful when the history of changes isn’t important, such as correcting errors or updating non-critical fields like customer email addresses or maintaining lookup tables.
  2. SCD Type 2 keeps the historical versions of data. This approach maintains a historical record of data changes by creating additional records to capture different versions of the data over time. Each version of the data is timestamped or tagged with metadata that allows users to trace when a change occurred. This is useful when it’s important to track the evolution of data, such as tracking customer address changes over time for analysis purposes.

While basic operational reporting can support SCD1, almost any analytic approach will benefit from history. ML models suffer from lack of data, and AI will be more likely to hallucinate. Let’s look at a simple example.

Monday Morning Dataset:

id name state
1 John NY
2 Jane CA
3 Juan PA

Tuesday Update: John moves from New York to New Jersey.

id name state
1 John NJ
2 Jane CA
3 Juan PA
  • SCD1 Result: Overwrites John’s state, leaving only three records.
  • SCD2 Result: Retains John’s NY record and adds a new NJ record, resulting in four records.

This important thing to understand here is that having John’s lifecycle is almost certainly valuable from an analytical perspective. This extremely small cost around storage is negligible compared to the potential lost opportunity of simply overwriting the data. I like to have SCD2 tables in the silver layer as a general rule in the medallion architecture. However, there were some issues with DLTs around this scenario.

Challenges with the APPLY CHANGES API

In the current state, SCD updates are managed through the APPLY CHANGES API. This API was more effective than Spark’s MERGE INTO statement. MERGE INTO is relatively straightforward until you start to factor in edge cases. For example, what if there are several updates to the same key in the same microbatch? What if the changes come in out of order? How do you handle DELETEs? Worse, how do you handle out-of-order DELETEs? However, APPLY CHANGES only worked for append-only data.

In its current state, a DLT pipeline creates a Directed Acyclic Graph (DAG) for all the tables and views in the pipeline using the metadata of the resources. Only the metadata. In many pipelines, the data from the source RDBMS has already been ingested into bronze and is refreshed daily. Lets look at our sample dataset. On Monday, I run the DLT. While the DLT is aware of the metadata of the table, it does not have access to the contents. Imagine a MERGE statement where no current records exists. Everything is an insert. Now imagine processing the next day’s data. Again, since only the metadata is loaded into the DAG, the APPLY CHANGES has no prior record of John. Effectively, only SCD1 tables can be created from mutable data sources since the data will not be loaded at this time.

The new Lakeflow process provides a mechanism where CDC can be used with the Lakeflow Connector to drive SCD2 semantics even with mutable data.

What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a data integration pattern that captures changes in a source system, like inserts, updates and deletes, through a CDC feed. The CDC feed stores a list of changes rather than the whole dataset, providing a performance opportunity. Most transactional databases, like SQL Server, Oracle and MySQL, can generate CDC feeds automatically. When a row in the source table is updated, a new set of rows is created in the CDC feed that only has the changes, plus some metadata like UPDATE or DELETE as well as a column that can be used to deterministically identify order, like a sequence number. There is also an update to APPLY CHANGES called AUTO CDC INTO.

AUTO CDC INTO

There are actually two APIs: AUTO CDC and AUTO CDC FROM SNAPSHOT. They have the same syntax as APPLY CHANGES, but they can now correctly handle more use cases. You may have already guessed that AUTO CDC FROM SNAPSHOT has the same method signature as APPLY CHANGES FROM SNAPSHOT. However, the AUTO CDC API supports periodic ingestion of snapshots with each pipeline update.  Since data and not just metadata is made available to the API, there is sufficient information made available to the call to correctly populate the SCD2 dataset.

Conclusion

Lakeflow Connect is a game-changer for data engineers, enabling SCD2 tables in the silver layer even with mutable data sources. By leveraging CDC and the new AUTO CDC INTO API, you can maintain historical data accurately, ensuring your AI and ML models have the context they need to perform optimally.

The future of data engineering is here, and it’s built on Lakeflow Connect.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

 

]]>
https://blogs.perficient.com/2025/06/21/lakeflow-revolutionizing-scd2-pipelines-with-change-data-capture-cdc/feed/ 0 383315