David Callaghan, Author at Perficient Blogs

IngestIQ – Hadoop to Databricks AI-powered Migration

David Callaghan — Thu, 05 Feb 2026 21:49:18 +0000

Organizations are migrating from their on-premise, legacy Hadoop Data Lake to a more modern data architecture to take advantage of AI to fulfill the long-awaited promise of unlocking business value from semi- and unstructured data. Databricks tends to be the modern platform of choice for Hadoop migrations due to core architectural similarities. Apache Spark has its roots in Hadoop, and its developers founded Databricks. There is a pretty good chance you are using Parquet as your file format in HDFS. They even share the Hive Metastore for data abstraction and discovery.

Teams tasked with migrating from their legacy Hadoop platforms to Databricks face unique and unexpected challenges. since Hadoop is a platform, not just a database. In fact, approaching this as a database migration hides most of the technical challenges and can lead to a fundamental misunderstanding of the scope of the project. This is particularly true when you consider Hive only as a lift-and-shift to Databricks. In many cases, it makes more sense to focus on the data movement rather than the data storage. Imagine an Oozie-first approach to a Hadoop migration.

Change your mindset from a data platform migration to a business process modernization, and read on.

Introducing IngestIQ

IngestIQ leverages cutting-edge AI models available in Databricks to ingest and translate a variety of workflows across the Hadoop ecosystem into an innovative Intermediate Domain Specific Language (iDSL). This AI-centric transformation yields a business-first perspective on data workflows, uncovering the underlying business intents and dataset value. With AI at its core, IngestIQ empowers a human-on-the-loop (HOTL) model to make precise, informed decisions that prioritize modernization and high-impact migratory strategies.

Traditional tools like Oozie, Airflow, and NiFi often encode complex operational logic rather than business rules, obscuring the true business value. By utilizing AI-driven insights, IngestIQ transforms these workflows into an iDSL that highlights business relevance, enabling stakeholders to make strategic, value-driven decisions. AI enhances the HOTL’s ability to discern critical, redundant, or obsolete jobs, focusing efforts on strategically significant modernization. This prioritization prevents misallocation of resources towards low-impact migrations, optimizing computational and storage costs while emphasizing data security, compliance, and business-critical areas.

Why this matters

Oozie deployments often encode operational logic, not business intent. Translating to an iDSL makes intent explicit, enabling business owners to triage what matters.
Human review reduces risk of incorrectly migrating jobs that are no longer needed or that embed obsolete business rules.
Column-level prioritization prevents over-migration of low-value data and focuses security, lineage, and Unity Catalog efforts where business impact is highest.
Provides auditable, repeatable decisioning and a clear path from discovery to production cutover in Databricks.

IngestIQ’s AI-Driven Capabilities

Comprehensive Ingestion & AI-Powered Analysis:
- AI algorithms process diverse inputs from Oozie XML workflows, Apache Airflow DAGs, and Apache NiFi flows. Both static analyses and AI-enhanced runtime assessments map job dependencies, execution metrics, and data lineage .
Business-First AI Representation with iDSL:
- The iDSL leverages AI to generate concise, business-centric representations of data workflows. This AI-driven translation surfaces transformation intents and dataset significance clearly, ensuring decisions align closely with strategic goals.
AI-Based Triage & Workflow Optimization:
- IngestIQ uses AI and machine learning classifiers to intelligently identify and optimize redundant, outdated, or misaligned workflows, supported by AI-derived evidence and confidence metrics.
AI-Enhanced HOTL Interface:
- Equipped with AI-powered dashboards and predictive analytics, the HOTL interface enables stakeholders to navigate prioritized actions efficiently.
Data-Driven Business-Priority AI Ranking:
- A sophisticated AI model evaluates workflows across multiple criteria—business criticality, usage patterns, technical debt, cost, and compliance pressures. This advanced AI prioritization focuses on the most impactful areas first.
Automated AI Workflow Generation:
- From AI-optimized iDSL inputs, IngestIQ automates the generation of Spark templates, migration scripts, and compliance documents that seamlessly integrate into CI/CD pipelines for robust, secure implementation.

Example flow (end-to-end)

Ingest Oozie metadata and execution logs => parse into ASTs and runtime profiles.
Generate iDSL artifacts representing jobs and transforms, store in Git.
Run triage models and rules => produce candidate list with evidence and priority scores.
HOTL reviews, annotates, and approves actions via UI; approvals create commits.
Approved artifacts trigger code & migration artifact generation (Spark templates, Delta migration scripts, Unity Catalog manifests).
CI pipeline runs tests (unit, differential), security checks, and human approval gates.
Deploy to Databricks staging; run parallel validation with Hadoop outputs; upon pass, cutover per schedule.
Capture telemetry to refine triage models and priority weighting.

Conclusion

The IngestIQ Accelerator provides a pragmatic, auditable bridge between legacy Hadoop operational workflows and business-led modernization. By making intent explicit and placing a human-on-the-loop for final decisions, organizations get the speed and repeatability of automated translation without sacrificing governance or business risk management. Column-level prioritization ensures effort and controls focus on data that matters most—reducing cost, improving security posture, and accelerating value realization on Databricks.

Perficient is a Databricks Elite Partner. Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

Base Is Loaded: Bridging OLTP and OLAP with Lakebase and PySpark

David Callaghan — Sun, 25 Jan 2026 17:45:19 +0000

For years, the Lakehouse paradigm has successfully collapsed the wall between Data Warehouses and Data Lakes. We have unified streaming and batch, structured and unstructured data, all under one roof. Yet we often find ourselves hitting a familiar, frustrating wall: the gap between the analytical plane (OLAP) and the transactional plane (OLTP). In my latest project, the client wanted to use Databricks to serve as both an analytic platform and power their front-end React web app. There is a sample Databricks App that uses NodeJS for a front end and FastAPI for a Python backend that connects to Lakebase. The sample ToDo app provides a sample front end that performs CRUD operations out of the box. I opened a new Databricks Query object, connected to the Lakebase compute, and verified the data. It’s hard to overstate how cool this seemed.

The next logical step was to build a declarative pipeline that would flow the data Lakebase received from the POST, PUT and GET requests through the Bronze layer, for data quality checks, into Silver for SCD2-style history and then into Gold where it would be available to end users through AI/BI Genie and PowerBI reports as well as being the source for a sync table back to Lakebase to serve GET statements. I created a new declarative pipeline in a source-controlled asset bundle and started building. Then I stopped building. That’s not supported. You actually need to communicate with Lakebase from a notebook using the SDK. A newer SDK than Serverless provides, no less.

A couple of caveats. At the time of this writing, I’m using Azure Databricks, so I only have access to Lakebase Provisioned and not Lakebase Autoscaling. And it’s still in Public Preview; maybe GA is different. Or, not. Regardless, I have to solve the problem on my desk today, and simply having the database isn’t enough. We need a robust way to interact with it programmatically from our notebooks and pipelines.

In this post, I want to walk through a Python architectural pattern I’ve developed—BaseIsLoaded. This includes pipeline configurations, usage patterns, and a PySpark class:LakebaseClient. This class serves two critical functions: it acts as a CRUD wrapper for notebook-based application logic, and, more importantly, it functions as a bridge to turn a standard Postgres table into a streaming source for declarative pipelines.

The Connectivity Challenge: Identity-Native Auth

The first hurdle in any database integration is authentication. In the enterprise, we are moving away from hardcoded credentials and .pgpass files. We want identity-native authentication. The LakebaseClient handles this by leveraging the databricks.sdk. Instead of managing static secrets, the class generates short-lived tokens on the fly.

Look at the _ensure_connection_info method in the provided code snippet:

def _ensure_connection_info(self, spark: SparkSession, value: Any):
    # Populate ``self._conn_info`` with the Lakebase endpoint and temporary token
  if self._conn_info is None:
    w = WorkspaceClient()
    instance_name = "my_lakebase" # Example instance
    instance = w.database.get_database_instance(name=instance_name)
    cred = w.database.generate_database_credential(
    request_id=str(uuid.uuid4()), instance_names=[instance_name]
  )
  self._conn_info = {
    "host": instance.read_write_dns,
    "dbname": "databricks_postgres",
    "password": cred.token, # Ephemeral token
    # ...
  }
    """)

This encapsulates the complexity of finding the endpoint and authenticating and allows us to enforce a “zero-trust” model within our code. The notebook or job running this code inherits the permissions of the service principal or user executing it, requesting a token valid only for that session.

Operationalizing DDL: Notebooks as Migration Scripts

One of the strongest use cases for Lakebase is managing application state or configuration for data products. However, managing the schema of a Postgres database usually requires an external migration tool (like Flyway or Alembic).

To keep the development lifecycle contained within Databricks, I extended the class to handle safe DDL execution. The class includes methods like create_table, alter_table_add_column, and create_index.

These methods use psycopg2.sql to handle identifier quoting safely. In a multi-tenant environment where table names might be dynamically generated based on business units or environments, either by human or agentic developers, SQL injection via table names is a real risk.

def create_table(self, schema: str, table: str, columns: List[str]):
    ddl = psql.SQL("CREATE TABLE IF NOT EXISTS {}.{} ( {} )").format(
        psql.Identifier(schema),
        psql.Identifier(table),
        psql.SQL(", ").join(psql.SQL(col) for col in columns)
    )
    self.execute_ddl(ddl.as_string(self._get_connection()))

This allows a Databricks Notebook to serve as an idempotent deployment script. You can define your schema in code and execute it as part of a “Setup” task in a Databricks Workflow, ensuring the OLTP layer exists before the ETL pipeline attempts to read from or write to it.

The Core Innovation: Turning Postgres into a Micro-Batch Stream

The most significant value of this architecture is the load_new_data method.

Standard JDBC connections in Spark are designed for throughput, not politeness. They default to reading the entire table or, if you attempt to parallelize reads via partitioning, they spawn multiple executors that can quickly exhaust the connection limit of Lakebase. By contrast, LakebaseClient runs intentionally on the driver using a single connection.

This solves a common dilemma we run into with our enterprise clients: if you have a transactional table (e.g., an orders table or a pipeline_audit log) in Lakebase and want to ingest it into Delta Lake incrementally, you usually have to introduce Kafka, Debezium, or complex CDC tools. If you have worked for a large, regulated company, you can appreciate the value of not asking for things.

Instead, LakebaseClient implements a lightweight “Client-Side CDC” pattern. It relies on a monotonic column (a checkpoint_column, such as an auto-incrementing ID or a modification_timestamp) to fetch only what has changed since the last run.

1. State Management with Delta

The challenge with custom polling logic is: where do you store the offset? If the cluster restarts, how does the reader know where it left off?

I solved this by using Delta Lake itself as the state store for the Postgres reader. The _persist_checkpoint and _load_persisted_checkpoint methods use a small Delta table to track the last_checkpoint for every source.

def _persist_checkpoint(self, spark: SparkSession, value: Any):
    # ... logic to create table if not exists ...
    # Upsert (merge) last checkpoint into a Delta table
    spark.sql(f"""
        MERGE INTO {self.checkpoint_store} t
        USING _cp_upsert_ s
        ON t.source_id = s.source_id
        WHEN MATCHED THEN UPDATE SET t.last_checkpoint = s.last_checkpoint
        WHEN NOT MATCHED THEN INSERT ...
    """)

This creates a robust cycle: The pipeline reads from Lakebase, processes the data, and commits the offset to Delta. This ensures exactly-once processing semantics (conceptually) for your custom ingestion logic.

2. The Micro-Batch Logic

The load_new_data method brings it all together. It creates a psycopg2 cursor, queries only the rows where checkpoint_col > last_checkpoint, limits the fetch size (to prevent OOM errors on the driver), and converts the result into a Spark DataFrame.

    if self.last_checkpoint is not None:
        query = psql.SQL(
            "SELECT * FROM {} WHERE {} > %s ORDER BY {} ASC{}"
        ).format(...)
        params = (self.last_checkpoint,)

By enforcing an ORDER BY on the monotonic column, we ensure that if we crash mid-batch, we simply resume from the last successfully processed ID.

Integration with Declarative Pipelines

So, how do we use this in a real-world enterprise scenario?

Imagine you have a “Control Plane” app running on a low-cost cluster that allows business users to update “Sales Targets” via a Streamlit app (backed by Lakebase). You want these targets to immediately impact your “Sales Reporting” Delta Live Table (DLT) pipeline.

Instead of a full refresh of the sales_targets table every hour, you can run a continuous or scheduled job using LakebaseClient.

The Workflow:

Instantiation:

lb_source = LakebaseClient(
    table_name="public.sales_targets",
    checkpoint_column="updated_at",
    checkpoint_store="system.control_plane.ingestion_offsets"
)

Ingestion Loop: You can wrap load_new_data in a simple loop or a scheduled task.

# Fetch micro-batch
df_new_targets = lb_source.load_new_data()

if not df_new_targets.isEmpty():
    # Append to Bronze Delta Table
    df_new_targets.write.format("delta").mode("append").saveAsTable("bronze.sales_targets")

Downstream DLT: Your main ETL pipeline simply reads from bronze.sales_targets as a standard streaming source. The LakebaseClient acts as the connector, effectively “streaming” changes from the OLTP layer into the Bronze layer.

Architectural Considerations and Limitations

While this class provides a powerful bridge, as architects, we must recognize the boundaries.

It is not a Debezium Replacement: This approach relies on “Query-based CDC.” It cannot capture hard deletes (unless you use soft-delete flags), and it relies on the checkpoint_column being strictly monotonic. If your application inserts data with past timestamps, this reader will miss them. My first use case was pretty simple; just a single API client performing CRUD operations. For true transaction log mining, you still need logical replication slots (which Lakebase supports, but requires a more complex setup).
Schema Inference: The _postgres_type_to_spark method in the code provides a conservative mapping. Postgres has rich types (like JSONB, HSTORE, custom enums). This class defaults unknown types to StringType. This is intentional design—it shifts the schema validation burden to the Bronze-to-Silver transformation in Delta, preventing the ingestion job from failing due to exotic Postgres types. I can see adding support for JSONB before this project is over, though.
Throughput: This runs on the driver or a single executor node (depending on how you parallelize calls). It is designed for “Control Plane” data—thousands of rows per minute, not millions of rows per second. Do not use this to replicate a high-volume trading ledger; use standard ingestion tools for that.

Conclusion

Lakebase fills the critical OLTP void in the Databricks ecosystem. However, a database is isolated until it is integrated. The BaseIsLoaded pattern demonstrated here offers a lightweight, Pythonic way to knit this transactional layer into your analytical backbone.

By abstracting authentication, safely handling DDL, and implementing stateful micro-batching via Delta-backed checkpoints, we can build data applications that are robust, secure, and entirely contained within the Databricks control plane. It allows us to stop treating application state as an “external problem” and start treating it as a native part of the Lakehouse architecture. Because, at the end of the day, adding Apps plus Lakebase to your toolbelt is too much fun to let a little glue code stand in your way.

Agentic AI for Real‑Time Pharmacovigilance on Databricks

David Callaghan — Wed, 01 Oct 2025 18:12:02 +0000

Adverse drug reaction (ADR) detection is a primary regulatory and patient-safety priority for life sciences and health systems. Traditional pharmacovigilance methods often depend on delayed signal detection from siloed data sources and require extensive manual evidence collection. This legacy approach is time-consuming, increases the risk of patient harm, and creates significant regulatory friction. For solution architects and engineers in healthcare and finance, optimizing data infrastructure to meet these challenges is a critical objective and a real headache.

Combining the Databricks Lakehouse Platform with Agentic AI presents a transformative path forward. This approach enables a closed-loop pharmacovigilance system that detects high-quality safety signals in near-real time, autonomously collects corroborating evidence, and routes validated alerts to clinicians and safety teams with complete auditability. By unifying data and AI on a single platform through Unity Catalog, organizations can reduce time-to-signal, increase signal precision, and provide the comprehensive data lineage that regulators demand. This integrated model offers a clear advantage over fragmented data warehouses or generic cloud stacks.

The Challenges in Modern Pharmacovigilance

To build an effective pharmacovigilance system, engineers must integrate a wide variety of data types. This includes structured electronic health records (EHR) in formats like FHIR, unstructured clinical notes, insurance claims, device telemetry from wearables, lab results, genomics, and patient-reported outcomes. This process presents several technical hurdles:

Data Heterogeneity and Velocity: The system must handle high-velocity streams from devices and patient apps alongside periodic updates from claims and EHR systems. Managing these disparate data types and speeds without creating bottlenecks is a significant challenge.
Sparse and Noisy Signals: ADR mentions can be buried in unstructured notes, timestamps may conflict across sources, and confounding variables like comorbidities or polypharmacy can obscure true signals.
Manual Evidence Collection: When a potential signal is flagged, safety teams often must manually re-query various systems and request patient charts, a process that delays signal confirmation and response.
Regulatory Traceability: Every step, from detection to escalation, must be reproducible. This requires clear, auditable provenance for both the data and the models used in the analysis.

The Databricks and Agentic AI Workflow

An agentic AI framework running on the Databricks Lakehouse provides a structured, scalable solution to these problems. This system uses modular, autonomous agents that work together to implement a continuous pharmacovigilance workflow. Each agent has a specific function, from ingesting data to escalating validated signals.

Step 1: Ingest and Normalize Data

The foundation of the workflow is a unified data layer built on Delta Lake. Ingestion & Normalization Agents are responsible for continuously pulling data from various sources into the Lakehouse.

Continuous Ingestion: Using Lakeflow Declarative Pipelines and Spark Structured Streaming, these agents ingest real-time data from EHRs (FHIR), claims, device telemetry, and patient reports. Data can be streamed from sources like Kafka or Azure Event Hubs directly into Delta tables.
Data Normalization: As data is ingested, agents perform crucial normalization tasks. This includes mapping medical codes to standards like RxNorm, SNOMED, and LOINC. They also resolve patient identities across different datasets using both deterministic and probabilistic linking methods, creating a canonical event timeline for each patient. This unified view is essential for accurate signal detection.

Step 2: Detect Signals with Multimodal AI

Once the data is clean and unified, Signal Detection Agents apply a suite of advanced models to identify potential ADRs. This multimodal approach significantly improves precision.

Multimodal Detectors: The system runs several types of detectors in parallel. Clinical Large Language Models (LLMs) and fine-tuned transformers extract relevant entities and context from unstructured clinical notes. Time-series anomaly detectors monitor device telemetry for unusual patterns, such as spikes in heart rate from a wearable.
Causal Inference: To distinguish true causality from mere correlation, statistical and counterfactual causal engines analyze the data to assess the strength of the association between a drug and a potential adverse event.
Scoring and Provenance: Each potential ADR is scored with an uncertainty estimate. Crucially, the system also attaches provenance pointers that link the signal back to the specific data and model version used for detection, ensuring full traceability.

Step 3: Collect Evidence Autonomously

When a candidate signal crosses a predefined confidence threshold, an Evidence Collection Agent is activated. This agent automates what is typically a manual and time-consuming process.

Automated Assembly: The agent automatically assembles a complete evidence package. It extracts relevant sections from patient charts, re-runs queries for lab trends, fetches associated genomics variants, and pulls specific windows of device telemetry data.
Targeted Data Pulls: If the initial evidence is incomplete, the agent can plan and execute targeted data pulls. For example, it could order a specific lab test, request a clinician chart review through an integrated system, or trigger a patient survey via a connected app to gather more information on symptoms and dosing adherence.

Step 4: Triage and Escalate Signals

With the evidence gathered, a Triage & Escalation Agent takes over. This agent applies business logic and risk models to determine the appropriate next step.

Composite Scoring: The agent aggregates all collected evidence and computes a composite risk and confidence score for the signal. It applies configurable business rules based on factors like event severity and regulatory reporting timelines.
Intelligent Escalation: For high-risk or ambiguous signals, the agent automatically escalates the issue to human safety teams by creating tickets in systems like Jira or ServiceNow. For clear, high-confidence signals that pose a lower operational risk, the system can be configured to auto-generate regulatory reports, such as 15-day expedited submissions, where permitted.

Step 5: Enable Continuous Learning

The final agent in the workflow closes the loop, ensuring the system improves over time. The Continuous Learning Agent uses feedback from human experts to refine the AI models.

Feedback Integration: Outcomes from chart reviews, follow-up labs, and final regulatory adjudications are fed back into the system’s training pipelines.
Model Retraining and Versioning: This new data is used to retrain and refine the signal detectors and causal models. MLflow tracks these updates, versioning the new models and linking them to the training data snapshot. This creates a fully auditable and continuously improving system that meets strict regulatory standards for model governance.

The Technical Architecture on Databricks

The power of this workflow comes from the tightly integrated components of the Databricks Lakehouse Platform.

Data Layer: Delta Lake serves as the single source of truth, storing versioned tables for all data types. Unity Catalog manages fine-grained access policies, including row-level masking, to protect sensitive patient information.
Continuous ETL & Feature Store: Delta Live Tables provide schema-aware pipelines for all data engineering tasks, while the integrated Feature Store offers managed feature views for models, ensuring consistency between training and inference.
Detection & Inference: Databricks provides integrated GPU clusters for training and fine-tuning clinical LLMs and other complex models. MLflow tracks experiments, registers model versions, and manages deployment metadata.
Agent Orchestration: Lakeflow Jobs coordinate the execution of all agent tasks, handling scheduling, retries, and dependencies. The agents themselves can be lightweight microservices or notebooks that interact with Databricks APIs.
Serving & Integrations: The platform offers low-latency model serving endpoints for real-time scoring. It can integrate with clinician portals via SMART-on-FHIR, ticketing systems, and messaging services to facilitate human-in-the-loop workflows.

Why This Approach Outperforms Alternatives

Architectures centered on traditional data warehouses like Snowflake often struggle with this use case because they separate storage from heavy ML compute. Tasks like LLM inference and streaming feature engineering require external GPU clusters and complex orchestration, which introduces latency, increases operational overhead, and fractures data lineage across systems. Similarly, a generic cloud stack requires significant integration effort to achieve the same level of data and model governance.

The Databricks Lakehouse co-locates multimodal data, continuous pipelines, GPU-enabled model lifecycles, and governed orchestration on a single, unified platform. This integration dramatically reduces friction and provides a practical, auditable, and scalable path to real-time pharmacovigilance. For solution architects and engineers, this means a faster, more reliable way to unlock real-time insights from complex healthcare data, ultimately improving patient safety and ensuring regulatory compliance.

Conclusion

By harnessing Databricks’ unified Lakehouse architecture and agentic AI, organizations can transform pharmacovigilance from a reactive, manual process into a proactive, intelligent system. This workflow not only accelerates adverse drug reaction detection but also streamlines evidence collection and triage, empowering teams to respond swiftly and accurately. The platform’s end-to-end traceability, scalable automation, and robust data governance support stringent regulatory demands while driving operational efficiency. Ultimately, implementing this modern approach leads to better patient outcomes, reduced risk, and a future-ready foundation for safety monitoring in life sciences.

Agentic AI Closed-Loop Systems for N-of-1 Treatment Optimization on Databricks

David Callaghan — Mon, 29 Sep 2025 21:41:57 +0000

Precision therapeutics for rare diseases as well as complex oncology cases is an area that may benefit from Agentic AI Closed-Loop (AACL) systems to enable individualized treatment optimization — a continuous process of proposing, testing, and adapting therapies for a single patient (N-of-1 trials).

N-of-1 problems are not typical for either clinicians or data systems. Type 2 diabetes in the US is more of an N-of-3.8×10^7 problem, so we’re looking at a profoundly different category of scaling. This lower number is not easier, because it implies existing treatment protocols have not been successful. N-of-1 optimization can discover effective regimens rapidly, but only with a data system that can manage dense multimodal signals (omics, time-series biosensors, lab results), provide fast model iteration, incorporate clinician-in-the-loop safety controls, and ensure rigorous provenance. We also need to consider the heavy cognitive load the clinician will be under. While traditional data analytics and machine learning algorithms will still play a key role, Agentic AI support can be invaluable.

Agentic AI Closed-Loop systems are relatively new, so let’s look at what a system designed to support this architecture would look like from the ground up.

Data Platform

First, let’s define the foundation of what we are trying to build. We need a clinical system that can deliver reproducible results with full lineage and enable safe automation to augment clinical judgement. That’s a decent overview of any clinical data system, so I feel like we’re on solid ground. I would posit that individualized treatment optimizations need a reduced iteration time from the standard, just because the smaller N means we have moved farther from the SoC, so there will likely be more experiments. Further, these experiments will need more clever validations. Siloed and fragmented data stores, disconnected data, disjoint model operationalization and heavy ETL are non-starters based on our foundational assumptions. A data lakehouse is a more appropriate architecture.

A data lakehouse is a unified data architecture that blends the low-cost, flexible storage of a data lake with the structure and management capabilities of a data warehouse. This combined approach allows organizations to store and manage both structured and unstructured data types on cost-effective cloud storage, while also providing high-performance analytics, data governance, and support for ML and AI workloads on the same data. Databricks currently has the most mature lakehouse implementation. Databricks is well known for handling multimodal data, so the variety of data is not a problem even at high volume.

Clinical processes are heavily regulated. Fortunately, Unity Catalog provides a high level of security and governance across your data, ML, and AI artifacts. Databricks provides a platform that can deliver auditable, regulatory-grade systems in a much more efficient and effective way than siloed data warehouse or other cloud data stacks. Realistically, data provenance alone is not sufficient to align the clinician’s cognitive load with the smaller N; it’s still a very hard problem. Honestly, since we have had lakehouses for some time and have not been able to reliably tackle n-of-1 at scale, the problem can’t soly be with the data system. This is where Agentic AI enters the scene.

Agentic AI

Agentic AI refers to systems of autonomous agents, modular reasoning units that plan, execute, observe, and adapt, orchestrated to complete complex workflows. Architecturally, Agentic AI running on Databricks’ Lakehouse platform uniquely enables safe, scalable N-of-1 systems by co-locating multimodal data, high-throughput model training, low-latency inference, and auditable model governance. This architecture accelerates time-to-effective therapy, reduces clinician cognitive load, and preserves regulatory-grade provenance in ways that are materially harder to deliver on siloed data warehouses or generic cloud stacks. Here are some examples of components of the Agentic AI system that might be used as a foundation for building our N-of-1 therapeutics system. There can and will be more agents, but they will likely be used to enhance or support this basic set.

Digital Twin Agents compile the patient’s multimodal state and historic responses.
Planner/Policy Agents propose treatment variants (dose, schedule, combination) using constrained optimization informed by transfer learning from cohort data.
Evaluation Agents collect outcome signals (biosensors, labs, imaging), compute reward/utility, and update the digital twin.
Safety/Compliance Agents enforce clinical constraints, route proposals for clinician review when needed, and produce provenance records

For N-of-1 therapeutics, there are distinct advantages to designing agents to form a closed loop. Let’s discuss why.

Agentic AI Closed Loop System

Agentic AI Closed Loops (AACL) enable AI systems to autonomously perceive, decide, act, and adapt within self-contained feedback cycles. The term “agentic” underscores the AI’s ability to proactively pursue goals without constant human oversight, while “closed loop” highlights its capacity to refine performance through internal feedback. This synergy empowers AACL systems to move beyond reactive processing, anticipating challenges and optimizing outcomes in real time. This is how we scale AI to realistically address clinician cognitive load within a highly regulated clinical framework.

Perception: The AI gathers information from its from its Digital Twin among other sources.
Reasoning and Planning: Based on its goals and perceived data of the current test iteration, the AI breaks down the objective into a sequence of actionable steps.
Action: The AI executes its plan, often through the Planner/Policy Agents.
Feedback and Learning: The system evaluates the outcome of its actions through the Evaluation Agents and compares them against its goals, referencing the Safety/Compliance Agents. It then learns from this feedback to refine its internal models and improve its performance in the next cycle.

AAIC systems are modular frameworks. Let’s wrap up with a proposed reference architecture or an AAIC system using Databricks.

AAIC on Databricks

We’ll start with a practical implementation of the data layer. Delta Lake provides versioned tables for EHR (FHIR-parquet), structured labs, medication history, genomics variants, and treatment metadata. Time-series data like high-cardinality biosensor streams can be ingested via Spark Structured Streaming into using time-partitioning and compaction. Databricks Lakeflow is a solid tool for this. Patient and cohort embeddings can be stored as vector columns or integrated with a co-located vector index.

The Feature and ETL Layer builds on Lakeflow’s capabilities. A declarative syntax and a UI provide for a low-code solution for building continuous pipelines to normalize clinica code and compute rolling features like time-windowed response metrics. The Databricks Feature Store patterns enable reusable feature views for inputs and predictors.

Databricks provides distributed GPU clusters for the model and agent layer as well as access to foundational and custom AI model. Lakeflow Jobs orchestrate agent execution, coordinate microservices (consent UI, clinician portal, device provisioning), and manage retries.

MLFlow manages most of the heavy lifting for serving and integration. You can serve low latency policy and summarization endpoints while supporting canary deployments and A/B testing. The integration endpoints can supply secure APIs for EHR actionability (SMART on FIHR) and clinician dashboards. You can also ensure the system meets audit and governance standards and practices using the MLFlow model registry as well as Unity Catalog for data/model access control.

Conclusion

Agentic AI closed-loop systems on a Databricks lakehouse offer an auditable, scalable foundation for rapid N-of-1 treatment optimization in precision therapeutics—especially for rare disease and complex oncology—by co-locating multimodal clinical data (omics, biosensors, labs), distributed GPU training, low-latency serving, and model governance (MLflow, Unity Catalog). Implementing Digital Twin, Planner/Policy, Evaluation, and Safety agents in a closed-loop workflow shortens iteration time, reduces clinician cognitive load, and preserves provenance for regulatory compliance, while reusable feature/ETL patterns, time-series versioning (Delta Lake), and vector indexes enable robust validation and canary deployments. Start with a strong data layer, declarative pipelines, and modular agent orchestration, then iterate with clinician oversight and governance to responsibly scale individualized N-of-1 optimizations and accelerate patient-specific outcomes.

Unlocking Business Success with Databricks One

David Callaghan — Mon, 30 Jun 2025 20:54:00 +0000

Business users don’t use notebooks. Full stop. And for that reason, most organizations don’t have business users accessing the Databricks UI. This has always been a fundamental flaw in Databricks’ push to democratize data and AI. This disconnect is almost enshrined in the medallion architecture: Bronze is for system accounts, data scientists with notebooks use the Silver layer, and Gold is for business users with reporting tools. This approach has been enough to take an organization part of the way towards self-service analytics. This approach is not working for GenAI, though. This was a major frustration with Genie Spaces. It was a tool made for business users but embedded in an IT interface. Databricks One is looking to change all that.

Using Databricks One

Databricks One is a unified platform experience that provides business users with a single point of entry into their data ecosystem. It removes technical complexity and offers a curated environment to interact with data, AI models, dashboards, and apps efficiently. Core features of Databricks One include:

AI/BI Dashboards: Users can view, explore, and drill into key KPIs and metrics without technical setup.
AI/BI Genie: A conversational AI interface allowing users to ask natural language questions like “Why did sales drop in April?” or “What are the top-performing regions?”
Custom Databricks Apps: Tailored applications that combine analytics, workflows, and AI models to meet specific business needs.
Content Browsing by Domain: Content is organized into relevant business areas such as “Customer 360” and “Marketing Campaign Performance,” fostering easy discovery and collaboration.

Administering Databricks One

Administrators can give users access to Databricks One via a consumer access entitlement. This is a basic, read-only entry point for business users that gives access to a simplified workspace that focuses on consuming dashboards, Genie spaces and Apps. Naturally, users will be working with Unity Catalog’s unified data access controls to maintain governance and security.

Conclusion

This is a very short blog because I try not to comment too early on pre-release features and Databricks One is scheduled for a beta release later this summer. This is more than just an incremental feature for a lot of our enterprise clients, though. I am looking at Databricks One as a fundamental architectural component for large enterprise implementations. I feel this is a huge step forward for practical data and intelligence democratization and I was just too excited to wait for more details.

Unlocking the Power of MLflow 3.0 in Databricks for GenAI

David Callaghan — Mon, 30 Jun 2025 19:42:09 +0000

Databricks recently announced support for MLflow 3.0, which features a range of enhancements that redefine model management for enterprises. Integrated seamlessly into Databricks, MLflow is an open-source platform designed to manage the complete machine learning lifecycle. It provides tools to track experiments, package code into reproducible runs, and share and deploy models. With the launch of MLflow 3.0, enterprises can expect state-of-the-art improvements in experiment tracking and evaluative capabilities on the Databricks Lakehouse platform. Let’s dive into the key enhancements from a GenAI perspective.

Comprehensive Tracing for GenAI Apps

One of the standout features in MLflow 3.0 is the introduction of comprehensive tracing capabilities for GenAI applications. This feature allows developers to observe and debug their AI apps with unprecedented clarity.

Key Benefits:

One-line instrumentation for over 20 popular libraries, including OpenAI, LangChain, and Anthropic
Complete execution visibility, capturing prompts, responses, latency, and costs
Production-ready implementation that works seamlessly in both development and production environments
OpenTelemetry compatibility for flexible data export and ownership

Use Case: A financial services company developing a chatbot for customer inquiries can use MLflow 3.0’s tracing to monitor the bot’s interactions, ensuring compliance with regulatory requirements and identifying areas for improvement.

Automated Quality Evaluation

MLflow 3.0 introduces automated evaluation using LLM judges, replacing manual testing with AI-powered assessments that match human expertise.

Key Features:

Pre-built judges for safety, hallucination detection, relevance, and correctness
Custom judges tailored to specific business requirements
Ability to train judges to align with domain experts’ judgment

Use Case: A healthcare AI startup can leverage these automated evaluations to ensure that their GenAI models provide accurate and safe medical information, which is crucial for maintaining trust and regulatory compliance.

Production Data Feedback Loop

MLflow 3.0 enables teams to turn every production interaction into an opportunity for improvement through integrated feedback and evaluation workflows.

Key Capabilities:

Expert feedback collection through reviewing, labeling, and live testing
End-user feedback capture with links to full execution context
Conversion of problematic traces into test cases for continuous improvement

Use Case: An e-commerce company can use this feature to collect and analyze customer interactions with their AI-powered product recommendation system, continuously refining the model based on real-world usage.

Enterprise-Grade Lifecycle Management

MLflow 3.0 provides comprehensive versioning, tracking, and governance tools for GenAI applications.

Key Features:

LoggedModels for tracking code, parameters, and evaluation metrics
Full lineage linking traces, evaluations, and feedback to specific versions
Upcoming Prompt Registry for centralized prompt management and A/B testing
Integration with Unity Catalog for enterprise-level governance

Use Case: A multinational corporation developing multiple GenAI applications can use these lifecycle management features to ensure consistency, compliance, and efficient collaboration across global teams.

Enhanced Integration with Databricks Ecosystem

MLflow 3.0’s GenAI features are deeply integrated with the Databricks platform, offering additional benefits for enterprise users.

Key Integrations:

Unity Catalog for unified governance of AI assets
Data Intelligence for connecting GenAI data to business data in the Databricks Lakehouse
Mosaic AI Agent Serving for production deployment with scalability and operational rigor

Use Case: A large retail company can leverage these integrations to deploy and manage GenAI models that analyze customer behavior, connecting insights from their AI models directly to their business intelligence systems.

Conclusion

Leveraging Model Context Protocol (MCP) for AI Efficiency in Databricks

David Callaghan — Mon, 30 Jun 2025 19:06:21 +0000

Model Context Protocol (MCP) is reshaping the way AI agents interface with data and tools, providing a robust framework for standardization and interoperability. As AI continues to permeate business landscapes, MCP offers particular advantages in creating scalable, efficient AI systems. This blog explores what MCP is, its role in the AI landscape, and focuses on implementing MCP within Databricks.

Understanding Model Context Protocol (MCP)

Model Context Protocol (MCP) is an open-source standard that facilitates the connection of AI agents with various tools, resources, and contextual information. The primary advantage of MCP is its standardization, which enables the reuse of tools across different agents, whether internally developed or third-party solutions. This ability to integrate tools from various sources makes MCP a versatile choice for modern enterprises. To quote their Introduction (and FAQ):

Think of MCP as a universal adapter for AI applications, similar to what USB-C is for physical devices. USB-C acts as a universal adapter to connect devices to various peripherals and accessories. Similarly, MCP provides a standardized way to connect AI applications to different data and tools.

The Importance of MCP in AI

In the current AI ecosystem, there is often a challenge in ensuring models remain consistent, interoperable, and integrated. MCP addresses these challenges by establishing a framework that enhances the lifecycle of models. For solution architects, this means deploying AI solutions that are adaptable and sustainable in the long term.MCP’s role in AI includes:

Standardization: Allows the creation and reuse of tools across different AI agents.
Interoperability: Facilitates seamless integration across different components of an AI ecosystem.
Adaptability: Enables AI models to easily adjust to changing business requirements and data patterns.

MCP Design Principles

There are several core design principles that inform the MCP architecture and implementation.

Servers should be extremely easy to build
Servers should be highly composable
Servers should not be able to read the whole conversation, nor “see into” other servers
Features can be added to servers and clients progressively

MCP Implementation In Databricks

Databricks provides significant support for MCP, making it a powerful platform for architects looking to implement these protocols. Below are strategies for utilizing MCP in Databricks:

Managed MCP Servers: Databricks offers managed MCP servers that connect AI agents to enterprise data, maintaining security through enforced Unity Catalog permissions. This does not only simplify the connection process but also ensures that data privacy and governance requirements are consistently met.Managed MCP servers in Databricks come in several types:
- Vector Search: Enables AI agents to query Databricks Vector Search indexes.
- Unity Catalog Functions: Allows running of functions within a specified schema.
- Genie Space: Facilitates insights gathering from structured data.
Custom MCP Servers: Besides leveraging managed servers, you can host your own MCP server as a Databricks app, which is particularly useful if existing MCP servers exist within your organization. Hosting involves setting up an HTTP-compatible transport server and configuring the app within Databricks using Python and Bash scripts.
Building and Deploying MCP Agents: Developing agents within Databricks involves using standard SDKs and Python libraries such as databricks-mcp, which simplify authentication and connection to MCP servers. A typical implementation involves setting up authentication, installing necessary dependencies, and writing custom agent code to connect and interact with MCP servers securely.Here’s a sample workflow for building an agent:
- Authentication with OAuth: Establish a secure connection to your Databricks workspace.
- Use Databricks SDKs: To implement MCP servers and engage with data using tools like Unity Catalog.
Deployment of these agents involves ensuring all necessary resources, such as Unity Catalog functions and Vector Search indexes, are specified for optimal performance.

Databricks’ pricing for managed MCP servers hinges on the type of computations—Unity Catalog functions and Genie use serverless SQL compute pricing, while custom servers follow Databricks Apps pricing.

Conclusion

The Model Context Protocol (MCP) is a transformative approach within the AI industry, promising improved standardization and interoperability for complex AI systems. Implementing MCP on Databricks offers a comprehensive pathway to deploy and manage AI agents efficiently and securely. By utilizing both managed and custom MCP servers, alongside robust capabilities of Databricks, companies can realize their AI ambitions with greater ease and security, ensuring their models are both effective today and prepared for future technological strides.

Understanding Clean Rooms: A Comparative Analysis Between Databricks and Snowflake

David Callaghan — Fri, 27 Jun 2025 21:45:04 +0000

“Clean rooms” have emerged as a pivotal data sharing innovation with both Databricks and Snowflake providing enterprise alternatives.

Clean rooms are secure environments designed to allow multiple parties to collaborate on data analysis without exposing sensitive details of data. They serve as a sandbox where participants can perform computations on shared datasets while keeping raw data isolated and secure. Clean rooms are especially beneficial in scenarios like cross-company research collaborations, ad measurement in marketing, and secure financial data exchanges.

Uses of Clean Rooms:

Data Privacy: Ensures that sensitive information is not revealed while still enabling data analysis.
Collaborative Analytics: Allows organizations to combine insights without sharing the actual data, which is vital in sectors like finance, healthcare, and advertising.
Regulatory Compliance: Assists in meeting stringent data protection norms such as GDPR and CCPA by maintaining data sovereignty.

Clean Rooms vs. Data Sharing

While clean rooms provide an environment for secure analysis, data sharing typically involves the actual exchange of data between parties. Here are the major differences:

Security:
- Clean Rooms: Offer a higher level of security by allowing analysis without exposing raw data.
- Data Sharing: Involves sharing of datasets, which requires robust encryption and access management to ensure security.
Control:
- Clean Rooms: Data remains under the control of the originating party, and only aggregated results or specific analyses are shared.
- Data Sharing: Data consumers can retain and further use shared datasets, often requiring complex agreements on usage.
Flexibility:
- Clean Rooms: Provide flexibility in analytics without the need to copy or transfer data.
- Data Sharing: Offers more direct access, but less flexibility in data privacy management.

High-Level Comparison: Databricks vs. Snowflake

Implementation
Databricks	Snowflake
Setup and Configuration: Utilize existing Databricks workspace Create a new Clean Room environment within the workspace Configure Delta Lake tables for shared data Data Preparation: Use Databricks’ data engineering capabilities to ETL and anonymize data Leverage Delta Lake for ACID transactions and data versioning Access Control: Implement fine-grained access controls using Unity Catalog Set up row-level and column-level security Collaboration: Share Databricks notebooks for collaborative analysis Use MLflow for experiment tracking and model management Analysis: Utilize Spark for distributed computing Support for SQL, Python, R, and Scala in the same environment	Setup and Configuration: Set up a separate Snowflake account for the Clean Room Create shared databases and views Data Preparation: Use Snowflake’s data engineering features or external tools for ETL Load prepared data into Snowflake tables Access Control: Implement Snowflake’s role-based access control Use secure views and row access policies Collaboration: Share data using Snowflake Data Sharing Utilize Snowsight for basic collaborative analytics Analysis: Primarily SQL-based analysis Use Snowpark for more advanced analytics in Python or Java
Business and IT Overhead
Databricks	Snowflake
Lower overhead if already using Databricks for other data tasks Unified platform for data engineering, analytics, and ML May require more specialized skills for advanced Spark operations	Easier setup and management for pure SQL users Less overhead for traditional data warehousing tasks Might need additional tools for complex data preparation and ML workflows
Cost Considerations
Databricks	Snowflake
More flexible pricing based on compute usage Can optimize costs with proper cluster management Potential for higher costs with intensive compute operations	Predictable pricing with credit-based system Separate storage and compute pricing Costs can escalate quickly with heavy query usage
Security and Governance
Databricks	Snowflake
Unity Catalog provides centralized governance across clouds Native integration with Delta Lake for ACID compliance Comprehensive audit logging and lineage tracking	Strong built-in security features Automated data encryption and key rotation Detailed access history and query logging
Data Format and Flexibility
Databricks	Snowflake
Supports various data formats (structured, semi-structured, unstructured) Supports various file formats (Parquet, Iceberg, csv,json, images, etc.) Better suited for large-scale data processing and transformations	Optimized for structured and semi-structured data Excellent performance for SQL queries on large datasets May require additional effort for unstructured data handling
Advanced Analytics, AI and ML
Databricks	Snowflake
Native support for advanced analytics and AI/ML workflows Integrated with popular AI/ML libraries and MLflow Easier to implement end-to-end AI/ML pipeline	Requires additional tools or Snowpark for advanced analytics Integration with external ML platforms needed for comprehensive ML workflows Strengths lie more in data warehousing than in ML operations
Scalability
Databricks	Snowflake
Auto-scaling of compute clusters and serverless compute options Better suited for processing very large datasets and complex computations	Automatic scaling and performance optimization May face limitations with extremely complex analytical workloads

Use Case Example: Financial Services Research Collaboration

Consider a research department within a financial services firm that wants to collaborate with other institutions on developing market insights through data analytics. They face a challenge: sharing proprietary and sensitive financial data without compromising security or privacy. Here’s how utilizing a clean room can solve this:

Implementation in Databricks:

Integration: By setting up a clean room in Databricks, the research department can securely integrate its datasets with other institutions; allowing sharing of data insights with precise access controls.
Analysis: Researchers from various departments can perform joint analyses on combined datasets without ever directly accessing each other’s raw data.
Security and Compliance: Databricks’ security features such as encryption, audit logging, and RBAC will ensure that all collaborations comply with regulatory standards.

Through this setup, the financial services firm’s research department can achieve meaningful collaboration and derive deeper insights from joint analyses, all while maintaining data privacy and adhering to compliance requirements.

By leveraging clean rooms, organizations in highly regulated industries can unlock new opportunities for innovation and data-driven decision-making without the risks associated with traditional data sharing methods.

Conclusion

Both Databricks and Snowflake offer robust solutions for implementing this financial research collaboration use case, but with different strengths and considerations.

Databricks excels in scenarios requiring advanced analytics, machine learning, and flexible data processing, making it well-suited for research departments with diverse analytical needs. It offers a more comprehensive platform for end-to-end data science workflows and is particularly advantageous for organizations already invested in the Databricks ecosystem.

Snowflake, on the other hand, shines in its simplicity and ease of use for traditional data warehousing and SQL-based analytics. Its strong data sharing capabilities and familiar SQL interface make it an attractive option for organizations primarily focused on structured data analysis and those with less complex machine learning requirements.

Regardless of the chosen platform, the implementation of Clean Rooms represents a significant step forward in enabling secure, compliant, and productive data collaboration in the financial sector. As data privacy regulations continue to evolve and the need for cross-institutional research grows, solutions like these will play an increasingly critical role in driving innovation while protecting sensitive information.

Perficient is both a Databricks Elite Partner and a Snowflake Premier Partner. Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

Transforming Your Data Strategy with Databricks Apps: A New Frontier

David Callaghan — Tue, 24 Jun 2025 21:10:30 +0000

I’ve been coding in notebooks for so long, I forgot how much I missed a nice, deployed application. I also didn’t realize how this was limiting my solution space. Then I started working with Databricks Apps.

Databricks Apps are designed to extend the functionality of the Databricks platform, providing users with enriched features and capabilities tailored to specific data needs. These apps can significantly enhance the data processing and analysis experience, offering bespoke solutions to address complex business requirements.

Key Features of Databricks Apps

Custom Solutions for Diverse Needs: Databricks Apps are built to cater to a wide range of use cases, from data transformation and orchestration to predictive analytics and AI-based insights. This versatility allows organizations to deploy applications that directly align with their specific business objectives.
Seamless Integration: The apps integrate smoothly within the existing Databricks environment, maintaining the platform’s renowned ease of use and ensuring that deployment does not disrupt current data processes. This seamless integration is crucial for maintaining operational efficiency and minimizing transition challenges.
Scalability and Flexibility: Databricks Apps are designed to scale with your organization’s needs, ensuring that as your data requirements grow, the solutions deployed through these apps can expand to meet those demands without compromising performance.
Enhanced Collaboration: By leveraging apps that foster collaboration, teams can work more effectively across different departments, sharing insights and aligning strategic goals with more precision and cohesion.

Benefits for Architects

Tailored Data Solutions: Databricks Apps enables architects to deploy tailored solutions that meet their unique data challenges, ensuring that technical capabilities are closely aligned with strategic business goals.
Accelerated Analytics Workflow: By using specialized apps, organizations can significantly speed up their data analytics workflows, leading to faster insights and more agile decision-making processes, essential in today’s fast-paced business environment.
Cost Efficiency: The capability to integrate custom-built apps reduces the need for additional third-party tools, potentially lowering overall costs and simplifying vendor management.
Future-Proofing Data Strategies: With the rapid evolution of technology, having access to a continuously expanding library of Databricks Apps helps organizations stay ahead of trends and adapt swiftly to new data opportunities and challenges.

Strategies for Effectively Leveraging Databricks Apps

To maximize the potential of Databricks Apps, CIOs and CDOs should consider the following approaches:

Identify Specific Use Cases: Before adopting new apps, identify the specific data operations and challenges your organization is facing. This targeted approach ensures that the apps you choose provide the most value.
Engage with App Developers: Collaborate with app developers who specialize in delivering comprehensive solutions tailored to your industry. Their expertise can enhance the implementation process and provide insights into best practices.
Promote Cross-Department Collaboration: Encourage departments across your organization to utilize these apps collaboratively. The synergistic use of advanced data solutions can drive more insightful analyses and foster a unified strategic direction.
Assess ROI Regularly: Continuously assess the return on investment from using Databricks Apps. This evaluation will help in determining their effectiveness and in making data-driven decisions regarding future app deployments.

Conclusion

Databricks Apps present a powerful opportunity for CIOs and CDOs to refine and advance their data strategies by offering tailored, scalable, and integrated solutions. By embracing these tools, organizations can transform their data-driven operations to gain a competitive edge in an increasingly complex business landscape.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

Exploring the Free Edition of Databricks: A Risk-Free Approach to Enterprise AI

David Callaghan — Tue, 24 Jun 2025 20:53:39 +0000

Databricks announced a full, free version of the platform at the Data and AI Summit. While the Free Edition is targeted to students and hobbyists, I also see opportunities where enterprise architects can effectively evangelize Databricks without going through Procurement for a license. Choosing the right platform to manage, analyze, and extract insights from massive datasets is crucial, especially with new and emerging GenAI use cases. We have seen many clients paralyzed by the combination of moving to a cloud database, comparing and contrasting the different offerings, and doing all of this analysis with only a very murky picture of what the new AI-driven future holds. The Community Edition has always been free, but it has not been feature-complete. With its new Free Edition, Databricks presents an exceptional opportunity for organizations to test its capabilities with no financial commitment or risk.

What is Databricks Free Edition?

The Free Edition of Databricks is designed to provide users with full access to Databricks’ core functionalities, allowing them to explore, experiment, and evaluate the platform’s potential without any initial investment. This edition is an excellent entry point for organizations looking to understand how Databricks can fit into their data strategy, providing a hands-on experience with the platform’s features.

Key Features of Databricks Free Edition

Simplified Setup and Onboarding: The Free Edition offers a straightforward setup process. Users can easily create an account and start exploring Databricks’ environment in a matter of minutes. This ease of access is ideal for decision-makers who want to quickly assess Databricks’ capabilities.
Complete Workspace Experience: Users of the Free Edition get access to a complete workspace, which includes all the necessary tools for data engineering, data science, and machine learning. This enables organizations to evaluate the entire data lifecycle on the Databricks platform.
Scalability and Performance: While the Free Edition is designed for evaluation purposes, it still provides a glimpse into the scalability and performance efficiency that Databricks is known for. Organizations can run small-scale analytics and machine learning tests to gauge how the platform handles data processing and computation tasks.
Community Support and Resources: Users can benefit from the extensive Databricks community, which offers support, tutorials, and shared resources. This can be particularly valuable for organizations exploring Databricks for the first time and wanting to leverage shared knowledge.
No Time Constraints: Unlike typical trial versions, the Free Edition does not impose a time limit, allowing organizations to explore the platform at their own pace. This flexibility is essential for CIOs and CDOs who might need extended periods to evaluate the platform’s potential fully.

Benefits for CIOs and CDOs

Risk-Free Evaluation: The primary advantage of the Free Edition is the risk-free nature of the exploration. CIOs and CDOs can test the platform’s capabilities without signing contracts or making financial commitments, aligning with their careful budget management strategies.
Strategic Insights for Data Strategy: By exploring Databricks firsthand, decision-makers can gain strategic insights into how the platform integrates with existing systems and processes. This understanding is crucial when considering a transition to a new data analytics platform.
Hands-On Experience: Direct interaction with Databricks helps bridge the gap between executive strategy and technical implementation. By experiencing the platform themselves, developers and architects can better champion its adoption across the organization.
Pre-Deployment Testing: The Free Edition enables organizations to test specific use cases and data workflows, helping identify any challenges or concerns before full deployment. This pre-deployment testing ensures that any transition to Databricks is smooth and well-informed.
Benchmarking Against Other Solutions: As organizations evaluate various data platforms, the Free Edition allows Databricks to be benchmarked against other solutions in the market. This comparison can be crucial in making informed decisions that align with long-term strategic goals.

Maximizing the Use of Databricks Free Edition

To maximize the benefits of Databricks Free Edition, CIOs and CDOs should consider the following strategies:

Define Use Cases: Before diving into the platform, define specific use cases you want to test. This could include data processing efficiency, machine learning model training, or real-time analytics capabilities. Clear objectives will provide focus and measurable outcomes.
Leverage Community Resources: Engage with the Databricks community to explore case studies, tutorials, and shared solutions that can offer fresh perspectives and innovative ideas.
Collaborate with Data Teams: Involve your data engineering and science teams early in the evaluation process. Their input and expertise will be invaluable in testing and providing feedback on the platform’s performance.
Evaluate Integration Points: During your exploration, assess how well Databricks integrates with existing systems and cloud services within your organization. Seamless integration is vital for minimizing disruption and maximizing workflow efficiency.

Conclusion

The Databricks Free Edition is an invaluable opportunity for CIOs and CDOs to explore the transformative potential of big data analytics on a leading platform without any associated risks.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

Exploring Lakebase: Databricks’ Next-Gen AI-Native OLTP Database

David Callaghan — Mon, 23 Jun 2025 01:02:29 +0000

Lakebase is Databricks‘ OLTP database and the latest member of its ML/AI offering. Databricks has incorporated various components to support its AI platform, including data components. The Feature Store has been available for some time as a governed, centralized repository that manages machine learning features throughout their lifecycle. Mosaic AI Vector Search is a vector index optimized for storing and retrieving embeddings, particularly for similarity searches and RAG use cases.

What’s Old is New Again

AI’s need for data demands that transactional and analytical workflows no longer be viewed as separate entities. Traditional OLTP databases were never designed to meet the speed and flexibility required by AI applications today. They often exist outside analytics frameworks, creating bottlenecks and requiring manual data integrations. Notably, databases are now being spun up by AI agents rather than human operators. The robustness of the transactional database’s query response time now needs to be augmented with an equally robust administrative response time.

Lakebase addresses these challenges by revolutionizing OLTP database architecture. Its core attributes—separation of storage and compute, openness, and serverless architecture—make it a powerful tool for modern developers and data engineers.

Key Features of Lakebase

1. Openness:

Built on the open-source Postgres framework, Lakebase ensures compatibility and avoids vendor lock-in. The open ecosystem promotes innovation and provides a versatile foundation for building sophisticated data applications.

2. Separation of Storage and Compute:

Lakebase allows independent scaling of storage and computation, reducing costs and improving efficiency. Data is stored in open formats within data lakes, offering flexibility and eliminating proprietary data lock-in.

3. Serverless Architecture:

Lakebase is designed for elasticity. It scales up or down automatically, even to zero, ensuring you’re only paying for what you use, making it a cost-effective solution.

4. Integrated with AI and the Lakehouse:

Swift integration with the Lakehouse platform means no need for complex ETL pipelines. Operational and analytical data flows are synchronized in real-time, providing a seamless experience for deploying AI and machine learning models.

5. AI-Ready:

The database design caters specifically to AI agents, facilitating massive AI team operations through branching and checkpoint capabilities. This makes development, experimentation, and deployment faster and more reliable.

Use Cases and Benefits

1. Real-Time Applications:

From e-commerce systems managing inventory while providing instant recommendations, to financial services executing automated trades, Lakebase supports low-latency operations critical for real-time decision-making.

2. AI and Machine Learning:

With built-in AI and machine learning capabilities, Lakebase supports feature engineering and real-time model serving, thus accelerating AI project deployments.

3. Industry Applications:

Different sectors like healthcare, retail, and manufacturing can leverage Lakebase’s seamless data integration to enhance workflows, improve customer relations, and automate processes based on real-time insights.

Getting Started with Lakebase

Setting up Lakebase on Databricks is a straightforward process. With a few clicks, users can provision PostgreSQL-compatible instances and begin exploring powerful data solutions. Key setup steps include enabling Lakebase in the Admin Console, configuring database instances, and utilizing the Lakebase dashboard for management.

Conclusion

Lakebase is not just a database; it’s a paradigm shift for OLTP systems in the age of AI. By integrating seamless data flow, offering flexible scaling, and supporting advanced AI capabilities, Lakebase empowers organizations to rethink and innovate their data architecture. Now is the perfect moment to explore Lakebase, unlocking new possibilities for intelligent and real-time data applications.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

Lakeflow: Revolutionizing SCD2 Pipelines with Change Data Capture (CDC)

David Callaghan — Sun, 22 Jun 2025 00:56:47 +0000

Several breakthrough announcements emerged at DAIS 2025, but the Lakeflow updates around building robust pipelines had the most immediate impact on my current code. Specifically, I can now see a clear path to persisting SCD2 (Slowly Changing Dimension Type 2) tables in the silver layer from mutable data sources. If this sentence resonates with you, we share a common challenge. If not, it soon will.

Maintaining history through Change Data Capture is critical for both AI and foundational use cases like Single View of the Customer. However, the current state of Delta Live Tables (DLT) pipelines only allows streaming tables to maintain SCD2 logic, while most data sources permit updates. Let’s dive into the technical challenges and how Lakeflow Connect is solving them.

Slowly Changing Dimensions

There are two options for managing changes: SCD1 and SCD2.

SCD Type 1 is focused on keeping only the latest data. This approach involves overwriting old data with new data whenever a change occurs. No history of changes is kept, and only the latest version of the data is available. This is useful when the history of changes isn’t important, such as correcting errors or updating non-critical fields like customer email addresses or maintaining lookup tables.
SCD Type 2 keeps the historical versions of data. This approach maintains a historical record of data changes by creating additional records to capture different versions of the data over time. Each version of the data is timestamped or tagged with metadata that allows users to trace when a change occurred. This is useful when it’s important to track the evolution of data, such as tracking customer address changes over time for analysis purposes.

While basic operational reporting can support SCD1, almost any analytic approach will benefit from history. ML models suffer from lack of data, and AI will be more likely to hallucinate. Let’s look at a simple example.

Monday Morning Dataset:

id	name	state
1	John	NY
2	Jane	CA
3	Juan	PA

Tuesday Update: John moves from New York to New Jersey.

id	name	state
1	John	NJ
2	Jane	CA
3	Juan	PA

SCD1 Result: Overwrites John’s state, leaving only three records.
SCD2 Result: Retains John’s NY record and adds a new NJ record, resulting in four records.

This important thing to understand here is that having John’s lifecycle is almost certainly valuable from an analytical perspective. This extremely small cost around storage is negligible compared to the potential lost opportunity of simply overwriting the data. I like to have SCD2 tables in the silver layer as a general rule in the medallion architecture. However, there were some issues with DLTs around this scenario.

Challenges with the APPLY CHANGES API

In the current state, SCD updates are managed through the APPLY CHANGES API. This API was more effective than Spark’s MERGE INTO statement. MERGE INTO is relatively straightforward until you start to factor in edge cases. For example, what if there are several updates to the same key in the same microbatch? What if the changes come in out of order? How do you handle DELETEs? Worse, how do you handle out-of-order DELETEs? However, APPLY CHANGES only worked for append-only data.

In its current state, a DLT pipeline creates a Directed Acyclic Graph (DAG) for all the tables and views in the pipeline using the metadata of the resources. Only the metadata. In many pipelines, the data from the source RDBMS has already been ingested into bronze and is refreshed daily. Lets look at our sample dataset. On Monday, I run the DLT. While the DLT is aware of the metadata of the table, it does not have access to the contents. Imagine a MERGE statement where no current records exists. Everything is an insert. Now imagine processing the next day’s data. Again, since only the metadata is loaded into the DAG, the APPLY CHANGES has no prior record of John. Effectively, only SCD1 tables can be created from mutable data sources since the data will not be loaded at this time.

The new Lakeflow process provides a mechanism where CDC can be used with the Lakeflow Connector to drive SCD2 semantics even with mutable data.

What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a data integration pattern that captures changes in a source system, like inserts, updates and deletes, through a CDC feed. The CDC feed stores a list of changes rather than the whole dataset, providing a performance opportunity. Most transactional databases, like SQL Server, Oracle and MySQL, can generate CDC feeds automatically. When a row in the source table is updated, a new set of rows is created in the CDC feed that only has the changes, plus some metadata like UPDATE or DELETE as well as a column that can be used to deterministically identify order, like a sequence number. There is also an update to APPLY CHANGES called AUTO CDC INTO.

AUTO CDC INTO

There are actually two APIs: AUTO CDC and AUTO CDC FROM SNAPSHOT. They have the same syntax as APPLY CHANGES, but they can now correctly handle more use cases. You may have already guessed that AUTO CDC FROM SNAPSHOT has the same method signature as APPLY CHANGES FROM SNAPSHOT. However, the AUTO CDC API supports periodic ingestion of snapshots with each pipeline update. Since data and not just metadata is made available to the API, there is sufficient information made available to the call to correctly populate the SCD2 dataset.

Conclusion

Lakeflow Connect is a game-changer for data engineers, enabling SCD2 tables in the silver layer even with mutable data sources. By leveraging CDC and the new AUTO CDC INTO API, you can maintain historical data accurately, ensuring your AI and ML models have the context they need to perform optimally.

The future of data engineering is here, and it’s built on Lakeflow Connect.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.