Databricks Articles / Blogs / Perficient

IngestIQ – Hadoop to Databricks AI-powered Migration

David Callaghan — Thu, 05 Feb 2026 21:49:18 +0000

Organizations are migrating from their on-premise, legacy Hadoop Data Lake to a more modern data architecture to take advantage of AI to fulfill the long-awaited promise of unlocking business value from semi- and unstructured data. Databricks tends to be the modern platform of choice for Hadoop migrations due to core architectural similarities. Apache Spark has its roots in Hadoop, and its developers founded Databricks. There is a pretty good chance you are using Parquet as your file format in HDFS. They even share the Hive Metastore for data abstraction and discovery.

Teams tasked with migrating from their legacy Hadoop platforms to Databricks face unique and unexpected challenges. since Hadoop is a platform, not just a database. In fact, approaching this as a database migration hides most of the technical challenges and can lead to a fundamental misunderstanding of the scope of the project. This is particularly true when you consider Hive only as a lift-and-shift to Databricks. In many cases, it makes more sense to focus on the data movement rather than the data storage. Imagine an Oozie-first approach to a Hadoop migration.

Change your mindset from a data platform migration to a business process modernization, and read on.

Introducing IngestIQ

IngestIQ leverages cutting-edge AI models available in Databricks to ingest and translate a variety of workflows across the Hadoop ecosystem into an innovative Intermediate Domain Specific Language (iDSL). This AI-centric transformation yields a business-first perspective on data workflows, uncovering the underlying business intents and dataset value. With AI at its core, IngestIQ empowers a human-on-the-loop (HOTL) model to make precise, informed decisions that prioritize modernization and high-impact migratory strategies.

Traditional tools like Oozie, Airflow, and NiFi often encode complex operational logic rather than business rules, obscuring the true business value. By utilizing AI-driven insights, IngestIQ transforms these workflows into an iDSL that highlights business relevance, enabling stakeholders to make strategic, value-driven decisions. AI enhances the HOTL’s ability to discern critical, redundant, or obsolete jobs, focusing efforts on strategically significant modernization. This prioritization prevents misallocation of resources towards low-impact migrations, optimizing computational and storage costs while emphasizing data security, compliance, and business-critical areas.

Why this matters

Oozie deployments often encode operational logic, not business intent. Translating to an iDSL makes intent explicit, enabling business owners to triage what matters.
Human review reduces risk of incorrectly migrating jobs that are no longer needed or that embed obsolete business rules.
Column-level prioritization prevents over-migration of low-value data and focuses security, lineage, and Unity Catalog efforts where business impact is highest.
Provides auditable, repeatable decisioning and a clear path from discovery to production cutover in Databricks.

IngestIQ’s AI-Driven Capabilities

Comprehensive Ingestion & AI-Powered Analysis:
- AI algorithms process diverse inputs from Oozie XML workflows, Apache Airflow DAGs, and Apache NiFi flows. Both static analyses and AI-enhanced runtime assessments map job dependencies, execution metrics, and data lineage .
Business-First AI Representation with iDSL:
- The iDSL leverages AI to generate concise, business-centric representations of data workflows. This AI-driven translation surfaces transformation intents and dataset significance clearly, ensuring decisions align closely with strategic goals.
AI-Based Triage & Workflow Optimization:
- IngestIQ uses AI and machine learning classifiers to intelligently identify and optimize redundant, outdated, or misaligned workflows, supported by AI-derived evidence and confidence metrics.
AI-Enhanced HOTL Interface:
- Equipped with AI-powered dashboards and predictive analytics, the HOTL interface enables stakeholders to navigate prioritized actions efficiently.
Data-Driven Business-Priority AI Ranking:
- A sophisticated AI model evaluates workflows across multiple criteria—business criticality, usage patterns, technical debt, cost, and compliance pressures. This advanced AI prioritization focuses on the most impactful areas first.
Automated AI Workflow Generation:
- From AI-optimized iDSL inputs, IngestIQ automates the generation of Spark templates, migration scripts, and compliance documents that seamlessly integrate into CI/CD pipelines for robust, secure implementation.

Example flow (end-to-end)

Ingest Oozie metadata and execution logs => parse into ASTs and runtime profiles.
Generate iDSL artifacts representing jobs and transforms, store in Git.
Run triage models and rules => produce candidate list with evidence and priority scores.
HOTL reviews, annotates, and approves actions via UI; approvals create commits.
Approved artifacts trigger code & migration artifact generation (Spark templates, Delta migration scripts, Unity Catalog manifests).
CI pipeline runs tests (unit, differential), security checks, and human approval gates.
Deploy to Databricks staging; run parallel validation with Hadoop outputs; upon pass, cutover per schedule.
Capture telemetry to refine triage models and priority weighting.

Conclusion

The IngestIQ Accelerator provides a pragmatic, auditable bridge between legacy Hadoop operational workflows and business-led modernization. By making intent explicit and placing a human-on-the-loop for final decisions, organizations get the speed and repeatability of automated translation without sacrificing governance or business risk management. Column-level prioritization ensures effort and controls focus on data that matters most—reducing cost, improving security posture, and accelerating value realization on Databricks.

Perficient is a Databricks Elite Partner. Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

Base Is Loaded: Bridging OLTP and OLAP with Lakebase and PySpark

David Callaghan — Sun, 25 Jan 2026 17:45:19 +0000

For years, the Lakehouse paradigm has successfully collapsed the wall between Data Warehouses and Data Lakes. We have unified streaming and batch, structured and unstructured data, all under one roof. Yet we often find ourselves hitting a familiar, frustrating wall: the gap between the analytical plane (OLAP) and the transactional plane (OLTP). In my latest project, the client wanted to use Databricks to serve as both an analytic platform and power their front-end React web app. There is a sample Databricks App that uses NodeJS for a front end and FastAPI for a Python backend that connects to Lakebase. The sample ToDo app provides a sample front end that performs CRUD operations out of the box. I opened a new Databricks Query object, connected to the Lakebase compute, and verified the data. It’s hard to overstate how cool this seemed.

The next logical step was to build a declarative pipeline that would flow the data Lakebase received from the POST, PUT and GET requests through the Bronze layer, for data quality checks, into Silver for SCD2-style history and then into Gold where it would be available to end users through AI/BI Genie and PowerBI reports as well as being the source for a sync table back to Lakebase to serve GET statements. I created a new declarative pipeline in a source-controlled asset bundle and started building. Then I stopped building. That’s not supported. You actually need to communicate with Lakebase from a notebook using the SDK. A newer SDK than Serverless provides, no less.

A couple of caveats. At the time of this writing, I’m using Azure Databricks, so I only have access to Lakebase Provisioned and not Lakebase Autoscaling. And it’s still in Public Preview; maybe GA is different. Or, not. Regardless, I have to solve the problem on my desk today, and simply having the database isn’t enough. We need a robust way to interact with it programmatically from our notebooks and pipelines.

In this post, I want to walk through a Python architectural pattern I’ve developed—BaseIsLoaded. This includes pipeline configurations, usage patterns, and a PySpark class:LakebaseClient. This class serves two critical functions: it acts as a CRUD wrapper for notebook-based application logic, and, more importantly, it functions as a bridge to turn a standard Postgres table into a streaming source for declarative pipelines.

The Connectivity Challenge: Identity-Native Auth

The first hurdle in any database integration is authentication. In the enterprise, we are moving away from hardcoded credentials and .pgpass files. We want identity-native authentication. The LakebaseClient handles this by leveraging the databricks.sdk. Instead of managing static secrets, the class generates short-lived tokens on the fly.

Look at the _ensure_connection_info method in the provided code snippet:

def _ensure_connection_info(self, spark: SparkSession, value: Any):
    # Populate ``self._conn_info`` with the Lakebase endpoint and temporary token
  if self._conn_info is None:
    w = WorkspaceClient()
    instance_name = "my_lakebase" # Example instance
    instance = w.database.get_database_instance(name=instance_name)
    cred = w.database.generate_database_credential(
    request_id=str(uuid.uuid4()), instance_names=[instance_name]
  )
  self._conn_info = {
    "host": instance.read_write_dns,
    "dbname": "databricks_postgres",
    "password": cred.token, # Ephemeral token
    # ...
  }
    """)

This encapsulates the complexity of finding the endpoint and authenticating and allows us to enforce a “zero-trust” model within our code. The notebook or job running this code inherits the permissions of the service principal or user executing it, requesting a token valid only for that session.

Operationalizing DDL: Notebooks as Migration Scripts

One of the strongest use cases for Lakebase is managing application state or configuration for data products. However, managing the schema of a Postgres database usually requires an external migration tool (like Flyway or Alembic).

To keep the development lifecycle contained within Databricks, I extended the class to handle safe DDL execution. The class includes methods like create_table, alter_table_add_column, and create_index.

These methods use psycopg2.sql to handle identifier quoting safely. In a multi-tenant environment where table names might be dynamically generated based on business units or environments, either by human or agentic developers, SQL injection via table names is a real risk.

def create_table(self, schema: str, table: str, columns: List[str]):
    ddl = psql.SQL("CREATE TABLE IF NOT EXISTS {}.{} ( {} )").format(
        psql.Identifier(schema),
        psql.Identifier(table),
        psql.SQL(", ").join(psql.SQL(col) for col in columns)
    )
    self.execute_ddl(ddl.as_string(self._get_connection()))

This allows a Databricks Notebook to serve as an idempotent deployment script. You can define your schema in code and execute it as part of a “Setup” task in a Databricks Workflow, ensuring the OLTP layer exists before the ETL pipeline attempts to read from or write to it.

The Core Innovation: Turning Postgres into a Micro-Batch Stream

The most significant value of this architecture is the load_new_data method.

Standard JDBC connections in Spark are designed for throughput, not politeness. They default to reading the entire table or, if you attempt to parallelize reads via partitioning, they spawn multiple executors that can quickly exhaust the connection limit of Lakebase. By contrast, LakebaseClient runs intentionally on the driver using a single connection.

This solves a common dilemma we run into with our enterprise clients: if you have a transactional table (e.g., an orders table or a pipeline_audit log) in Lakebase and want to ingest it into Delta Lake incrementally, you usually have to introduce Kafka, Debezium, or complex CDC tools. If you have worked for a large, regulated company, you can appreciate the value of not asking for things.

Instead, LakebaseClient implements a lightweight “Client-Side CDC” pattern. It relies on a monotonic column (a checkpoint_column, such as an auto-incrementing ID or a modification_timestamp) to fetch only what has changed since the last run.

1. State Management with Delta

The challenge with custom polling logic is: where do you store the offset? If the cluster restarts, how does the reader know where it left off?

I solved this by using Delta Lake itself as the state store for the Postgres reader. The _persist_checkpoint and _load_persisted_checkpoint methods use a small Delta table to track the last_checkpoint for every source.

def _persist_checkpoint(self, spark: SparkSession, value: Any):
    # ... logic to create table if not exists ...
    # Upsert (merge) last checkpoint into a Delta table
    spark.sql(f"""
        MERGE INTO {self.checkpoint_store} t
        USING _cp_upsert_ s
        ON t.source_id = s.source_id
        WHEN MATCHED THEN UPDATE SET t.last_checkpoint = s.last_checkpoint
        WHEN NOT MATCHED THEN INSERT ...
    """)

This creates a robust cycle: The pipeline reads from Lakebase, processes the data, and commits the offset to Delta. This ensures exactly-once processing semantics (conceptually) for your custom ingestion logic.

2. The Micro-Batch Logic

The load_new_data method brings it all together. It creates a psycopg2 cursor, queries only the rows where checkpoint_col > last_checkpoint, limits the fetch size (to prevent OOM errors on the driver), and converts the result into a Spark DataFrame.

    if self.last_checkpoint is not None:
        query = psql.SQL(
            "SELECT * FROM {} WHERE {} > %s ORDER BY {} ASC{}"
        ).format(...)
        params = (self.last_checkpoint,)

By enforcing an ORDER BY on the monotonic column, we ensure that if we crash mid-batch, we simply resume from the last successfully processed ID.

Integration with Declarative Pipelines

So, how do we use this in a real-world enterprise scenario?

Imagine you have a “Control Plane” app running on a low-cost cluster that allows business users to update “Sales Targets” via a Streamlit app (backed by Lakebase). You want these targets to immediately impact your “Sales Reporting” Delta Live Table (DLT) pipeline.

Instead of a full refresh of the sales_targets table every hour, you can run a continuous or scheduled job using LakebaseClient.

The Workflow:

Instantiation:

lb_source = LakebaseClient(
    table_name="public.sales_targets",
    checkpoint_column="updated_at",
    checkpoint_store="system.control_plane.ingestion_offsets"
)

Ingestion Loop: You can wrap load_new_data in a simple loop or a scheduled task.

# Fetch micro-batch
df_new_targets = lb_source.load_new_data()

if not df_new_targets.isEmpty():
    # Append to Bronze Delta Table
    df_new_targets.write.format("delta").mode("append").saveAsTable("bronze.sales_targets")

Downstream DLT: Your main ETL pipeline simply reads from bronze.sales_targets as a standard streaming source. The LakebaseClient acts as the connector, effectively “streaming” changes from the OLTP layer into the Bronze layer.

Architectural Considerations and Limitations

While this class provides a powerful bridge, as architects, we must recognize the boundaries.

It is not a Debezium Replacement: This approach relies on “Query-based CDC.” It cannot capture hard deletes (unless you use soft-delete flags), and it relies on the checkpoint_column being strictly monotonic. If your application inserts data with past timestamps, this reader will miss them. My first use case was pretty simple; just a single API client performing CRUD operations. For true transaction log mining, you still need logical replication slots (which Lakebase supports, but requires a more complex setup).
Schema Inference: The _postgres_type_to_spark method in the code provides a conservative mapping. Postgres has rich types (like JSONB, HSTORE, custom enums). This class defaults unknown types to StringType. This is intentional design—it shifts the schema validation burden to the Bronze-to-Silver transformation in Delta, preventing the ingestion job from failing due to exotic Postgres types. I can see adding support for JSONB before this project is over, though.
Throughput: This runs on the driver or a single executor node (depending on how you parallelize calls). It is designed for “Control Plane” data—thousands of rows per minute, not millions of rows per second. Do not use this to replicate a high-volume trading ledger; use standard ingestion tools for that.

Conclusion

Lakebase fills the critical OLTP void in the Databricks ecosystem. However, a database is isolated until it is integrated. The BaseIsLoaded pattern demonstrated here offers a lightweight, Pythonic way to knit this transactional layer into your analytical backbone.

By abstracting authentication, safely handling DDL, and implementing stateful micro-batching via Delta-backed checkpoints, we can build data applications that are robust, secure, and entirely contained within the Databricks control plane. It allows us to stop treating application state as an “external problem” and start treating it as a native part of the Lakehouse architecture. Because, at the end of the day, adding Apps plus Lakebase to your toolbelt is too much fun to let a little glue code stand in your way.

Agentic AI for Real‑Time Pharmacovigilance on Databricks

David Callaghan — Wed, 01 Oct 2025 18:12:02 +0000

Adverse drug reaction (ADR) detection is a primary regulatory and patient-safety priority for life sciences and health systems. Traditional pharmacovigilance methods often depend on delayed signal detection from siloed data sources and require extensive manual evidence collection. This legacy approach is time-consuming, increases the risk of patient harm, and creates significant regulatory friction. For solution architects and engineers in healthcare and finance, optimizing data infrastructure to meet these challenges is a critical objective and a real headache.

Combining the Databricks Lakehouse Platform with Agentic AI presents a transformative path forward. This approach enables a closed-loop pharmacovigilance system that detects high-quality safety signals in near-real time, autonomously collects corroborating evidence, and routes validated alerts to clinicians and safety teams with complete auditability. By unifying data and AI on a single platform through Unity Catalog, organizations can reduce time-to-signal, increase signal precision, and provide the comprehensive data lineage that regulators demand. This integrated model offers a clear advantage over fragmented data warehouses or generic cloud stacks.

The Challenges in Modern Pharmacovigilance

To build an effective pharmacovigilance system, engineers must integrate a wide variety of data types. This includes structured electronic health records (EHR) in formats like FHIR, unstructured clinical notes, insurance claims, device telemetry from wearables, lab results, genomics, and patient-reported outcomes. This process presents several technical hurdles:

Data Heterogeneity and Velocity: The system must handle high-velocity streams from devices and patient apps alongside periodic updates from claims and EHR systems. Managing these disparate data types and speeds without creating bottlenecks is a significant challenge.
Sparse and Noisy Signals: ADR mentions can be buried in unstructured notes, timestamps may conflict across sources, and confounding variables like comorbidities or polypharmacy can obscure true signals.
Manual Evidence Collection: When a potential signal is flagged, safety teams often must manually re-query various systems and request patient charts, a process that delays signal confirmation and response.
Regulatory Traceability: Every step, from detection to escalation, must be reproducible. This requires clear, auditable provenance for both the data and the models used in the analysis.

The Databricks and Agentic AI Workflow

An agentic AI framework running on the Databricks Lakehouse provides a structured, scalable solution to these problems. This system uses modular, autonomous agents that work together to implement a continuous pharmacovigilance workflow. Each agent has a specific function, from ingesting data to escalating validated signals.

Step 1: Ingest and Normalize Data

The foundation of the workflow is a unified data layer built on Delta Lake. Ingestion & Normalization Agents are responsible for continuously pulling data from various sources into the Lakehouse.

Continuous Ingestion: Using Lakeflow Declarative Pipelines and Spark Structured Streaming, these agents ingest real-time data from EHRs (FHIR), claims, device telemetry, and patient reports. Data can be streamed from sources like Kafka or Azure Event Hubs directly into Delta tables.
Data Normalization: As data is ingested, agents perform crucial normalization tasks. This includes mapping medical codes to standards like RxNorm, SNOMED, and LOINC. They also resolve patient identities across different datasets using both deterministic and probabilistic linking methods, creating a canonical event timeline for each patient. This unified view is essential for accurate signal detection.

Step 2: Detect Signals with Multimodal AI

Once the data is clean and unified, Signal Detection Agents apply a suite of advanced models to identify potential ADRs. This multimodal approach significantly improves precision.

Multimodal Detectors: The system runs several types of detectors in parallel. Clinical Large Language Models (LLMs) and fine-tuned transformers extract relevant entities and context from unstructured clinical notes. Time-series anomaly detectors monitor device telemetry for unusual patterns, such as spikes in heart rate from a wearable.
Causal Inference: To distinguish true causality from mere correlation, statistical and counterfactual causal engines analyze the data to assess the strength of the association between a drug and a potential adverse event.
Scoring and Provenance: Each potential ADR is scored with an uncertainty estimate. Crucially, the system also attaches provenance pointers that link the signal back to the specific data and model version used for detection, ensuring full traceability.

Step 3: Collect Evidence Autonomously

When a candidate signal crosses a predefined confidence threshold, an Evidence Collection Agent is activated. This agent automates what is typically a manual and time-consuming process.

Automated Assembly: The agent automatically assembles a complete evidence package. It extracts relevant sections from patient charts, re-runs queries for lab trends, fetches associated genomics variants, and pulls specific windows of device telemetry data.
Targeted Data Pulls: If the initial evidence is incomplete, the agent can plan and execute targeted data pulls. For example, it could order a specific lab test, request a clinician chart review through an integrated system, or trigger a patient survey via a connected app to gather more information on symptoms and dosing adherence.

Step 4: Triage and Escalate Signals

With the evidence gathered, a Triage & Escalation Agent takes over. This agent applies business logic and risk models to determine the appropriate next step.

Composite Scoring: The agent aggregates all collected evidence and computes a composite risk and confidence score for the signal. It applies configurable business rules based on factors like event severity and regulatory reporting timelines.
Intelligent Escalation: For high-risk or ambiguous signals, the agent automatically escalates the issue to human safety teams by creating tickets in systems like Jira or ServiceNow. For clear, high-confidence signals that pose a lower operational risk, the system can be configured to auto-generate regulatory reports, such as 15-day expedited submissions, where permitted.

Step 5: Enable Continuous Learning

The final agent in the workflow closes the loop, ensuring the system improves over time. The Continuous Learning Agent uses feedback from human experts to refine the AI models.

Feedback Integration: Outcomes from chart reviews, follow-up labs, and final regulatory adjudications are fed back into the system’s training pipelines.
Model Retraining and Versioning: This new data is used to retrain and refine the signal detectors and causal models. MLflow tracks these updates, versioning the new models and linking them to the training data snapshot. This creates a fully auditable and continuously improving system that meets strict regulatory standards for model governance.

The Technical Architecture on Databricks

The power of this workflow comes from the tightly integrated components of the Databricks Lakehouse Platform.

Data Layer: Delta Lake serves as the single source of truth, storing versioned tables for all data types. Unity Catalog manages fine-grained access policies, including row-level masking, to protect sensitive patient information.
Continuous ETL & Feature Store: Delta Live Tables provide schema-aware pipelines for all data engineering tasks, while the integrated Feature Store offers managed feature views for models, ensuring consistency between training and inference.
Detection & Inference: Databricks provides integrated GPU clusters for training and fine-tuning clinical LLMs and other complex models. MLflow tracks experiments, registers model versions, and manages deployment metadata.
Agent Orchestration: Lakeflow Jobs coordinate the execution of all agent tasks, handling scheduling, retries, and dependencies. The agents themselves can be lightweight microservices or notebooks that interact with Databricks APIs.
Serving & Integrations: The platform offers low-latency model serving endpoints for real-time scoring. It can integrate with clinician portals via SMART-on-FHIR, ticketing systems, and messaging services to facilitate human-in-the-loop workflows.

Why This Approach Outperforms Alternatives

Architectures centered on traditional data warehouses like Snowflake often struggle with this use case because they separate storage from heavy ML compute. Tasks like LLM inference and streaming feature engineering require external GPU clusters and complex orchestration, which introduces latency, increases operational overhead, and fractures data lineage across systems. Similarly, a generic cloud stack requires significant integration effort to achieve the same level of data and model governance.

The Databricks Lakehouse co-locates multimodal data, continuous pipelines, GPU-enabled model lifecycles, and governed orchestration on a single, unified platform. This integration dramatically reduces friction and provides a practical, auditable, and scalable path to real-time pharmacovigilance. For solution architects and engineers, this means a faster, more reliable way to unlock real-time insights from complex healthcare data, ultimately improving patient safety and ensuring regulatory compliance.

Conclusion

By harnessing Databricks’ unified Lakehouse architecture and agentic AI, organizations can transform pharmacovigilance from a reactive, manual process into a proactive, intelligent system. This workflow not only accelerates adverse drug reaction detection but also streamlines evidence collection and triage, empowering teams to respond swiftly and accurately. The platform’s end-to-end traceability, scalable automation, and robust data governance support stringent regulatory demands while driving operational efficiency. Ultimately, implementing this modern approach leads to better patient outcomes, reduced risk, and a future-ready foundation for safety monitoring in life sciences.

Agentic AI Closed-Loop Systems for N-of-1 Treatment Optimization on Databricks

David Callaghan — Mon, 29 Sep 2025 21:41:57 +0000

Precision therapeutics for rare diseases as well as complex oncology cases is an area that may benefit from Agentic AI Closed-Loop (AACL) systems to enable individualized treatment optimization — a continuous process of proposing, testing, and adapting therapies for a single patient (N-of-1 trials).

N-of-1 problems are not typical for either clinicians or data systems. Type 2 diabetes in the US is more of an N-of-3.8×10^7 problem, so we’re looking at a profoundly different category of scaling. This lower number is not easier, because it implies existing treatment protocols have not been successful. N-of-1 optimization can discover effective regimens rapidly, but only with a data system that can manage dense multimodal signals (omics, time-series biosensors, lab results), provide fast model iteration, incorporate clinician-in-the-loop safety controls, and ensure rigorous provenance. We also need to consider the heavy cognitive load the clinician will be under. While traditional data analytics and machine learning algorithms will still play a key role, Agentic AI support can be invaluable.

Agentic AI Closed-Loop systems are relatively new, so let’s look at what a system designed to support this architecture would look like from the ground up.

Data Platform

First, let’s define the foundation of what we are trying to build. We need a clinical system that can deliver reproducible results with full lineage and enable safe automation to augment clinical judgement. That’s a decent overview of any clinical data system, so I feel like we’re on solid ground. I would posit that individualized treatment optimizations need a reduced iteration time from the standard, just because the smaller N means we have moved farther from the SoC, so there will likely be more experiments. Further, these experiments will need more clever validations. Siloed and fragmented data stores, disconnected data, disjoint model operationalization and heavy ETL are non-starters based on our foundational assumptions. A data lakehouse is a more appropriate architecture.

A data lakehouse is a unified data architecture that blends the low-cost, flexible storage of a data lake with the structure and management capabilities of a data warehouse. This combined approach allows organizations to store and manage both structured and unstructured data types on cost-effective cloud storage, while also providing high-performance analytics, data governance, and support for ML and AI workloads on the same data. Databricks currently has the most mature lakehouse implementation. Databricks is well known for handling multimodal data, so the variety of data is not a problem even at high volume.

Clinical processes are heavily regulated. Fortunately, Unity Catalog provides a high level of security and governance across your data, ML, and AI artifacts. Databricks provides a platform that can deliver auditable, regulatory-grade systems in a much more efficient and effective way than siloed data warehouse or other cloud data stacks. Realistically, data provenance alone is not sufficient to align the clinician’s cognitive load with the smaller N; it’s still a very hard problem. Honestly, since we have had lakehouses for some time and have not been able to reliably tackle n-of-1 at scale, the problem can’t soly be with the data system. This is where Agentic AI enters the scene.

Agentic AI

Agentic AI refers to systems of autonomous agents, modular reasoning units that plan, execute, observe, and adapt, orchestrated to complete complex workflows. Architecturally, Agentic AI running on Databricks’ Lakehouse platform uniquely enables safe, scalable N-of-1 systems by co-locating multimodal data, high-throughput model training, low-latency inference, and auditable model governance. This architecture accelerates time-to-effective therapy, reduces clinician cognitive load, and preserves regulatory-grade provenance in ways that are materially harder to deliver on siloed data warehouses or generic cloud stacks. Here are some examples of components of the Agentic AI system that might be used as a foundation for building our N-of-1 therapeutics system. There can and will be more agents, but they will likely be used to enhance or support this basic set.

Digital Twin Agents compile the patient’s multimodal state and historic responses.
Planner/Policy Agents propose treatment variants (dose, schedule, combination) using constrained optimization informed by transfer learning from cohort data.
Evaluation Agents collect outcome signals (biosensors, labs, imaging), compute reward/utility, and update the digital twin.
Safety/Compliance Agents enforce clinical constraints, route proposals for clinician review when needed, and produce provenance records

For N-of-1 therapeutics, there are distinct advantages to designing agents to form a closed loop. Let’s discuss why.

Agentic AI Closed Loop System

Agentic AI Closed Loops (AACL) enable AI systems to autonomously perceive, decide, act, and adapt within self-contained feedback cycles. The term “agentic” underscores the AI’s ability to proactively pursue goals without constant human oversight, while “closed loop” highlights its capacity to refine performance through internal feedback. This synergy empowers AACL systems to move beyond reactive processing, anticipating challenges and optimizing outcomes in real time. This is how we scale AI to realistically address clinician cognitive load within a highly regulated clinical framework.

Perception: The AI gathers information from its from its Digital Twin among other sources.
Reasoning and Planning: Based on its goals and perceived data of the current test iteration, the AI breaks down the objective into a sequence of actionable steps.
Action: The AI executes its plan, often through the Planner/Policy Agents.
Feedback and Learning: The system evaluates the outcome of its actions through the Evaluation Agents and compares them against its goals, referencing the Safety/Compliance Agents. It then learns from this feedback to refine its internal models and improve its performance in the next cycle.

AAIC systems are modular frameworks. Let’s wrap up with a proposed reference architecture or an AAIC system using Databricks.

AAIC on Databricks

We’ll start with a practical implementation of the data layer. Delta Lake provides versioned tables for EHR (FHIR-parquet), structured labs, medication history, genomics variants, and treatment metadata. Time-series data like high-cardinality biosensor streams can be ingested via Spark Structured Streaming into using time-partitioning and compaction. Databricks Lakeflow is a solid tool for this. Patient and cohort embeddings can be stored as vector columns or integrated with a co-located vector index.

The Feature and ETL Layer builds on Lakeflow’s capabilities. A declarative syntax and a UI provide for a low-code solution for building continuous pipelines to normalize clinica code and compute rolling features like time-windowed response metrics. The Databricks Feature Store patterns enable reusable feature views for inputs and predictors.

Databricks provides distributed GPU clusters for the model and agent layer as well as access to foundational and custom AI model. Lakeflow Jobs orchestrate agent execution, coordinate microservices (consent UI, clinician portal, device provisioning), and manage retries.

MLFlow manages most of the heavy lifting for serving and integration. You can serve low latency policy and summarization endpoints while supporting canary deployments and A/B testing. The integration endpoints can supply secure APIs for EHR actionability (SMART on FIHR) and clinician dashboards. You can also ensure the system meets audit and governance standards and practices using the MLFlow model registry as well as Unity Catalog for data/model access control.

Conclusion

Agentic AI closed-loop systems on a Databricks lakehouse offer an auditable, scalable foundation for rapid N-of-1 treatment optimization in precision therapeutics—especially for rare disease and complex oncology—by co-locating multimodal clinical data (omics, biosensors, labs), distributed GPU training, low-latency serving, and model governance (MLflow, Unity Catalog). Implementing Digital Twin, Planner/Policy, Evaluation, and Safety agents in a closed-loop workflow shortens iteration time, reduces clinician cognitive load, and preserves provenance for regulatory compliance, while reusable feature/ETL patterns, time-series versioning (Delta Lake), and vector indexes enable robust validation and canary deployments. Start with a strong data layer, declarative pipelines, and modular agent orchestration, then iterate with clinician oversight and governance to responsibly scale individualized N-of-1 optimizations and accelerate patient-specific outcomes.

A Recipe to Boost Predictive Modeling Efficiency

Mazen Manasseh — Tue, 22 Jul 2025 18:48:29 +0000

Implementing predictive analytical insights has become ever so essential for organizations to operate efficiently and remain relevant. What is important while doing this though is to be agile and adaptable. This is much so because what holds valid for a period can easily become obsolete with time. And what is characteristic of a specific group of customers, for example, varies widely with a diverse audience. Therefore, going from an envisioned innovative business idea to a working AI/ML model requires a mechanism that allows for a rapid and AI-driven approach.

In this post, I explain how Databricks, GitHub Copilot and Visual Studio Code IDE (VS Code) together offer an elevated experience when it comes to implementing predictive ML models efficiently. Even with minimal coding and data science experience, one can build, test and deploy predictive models. The synergy I’ve seen that GitHub Copilot has from within VS Code with MLflow and Databricks Experiments is remarkable. Here is how this approach goes.

Prerequisites

Before starting, there are a few one-time setup steps to configure VS Code so it’s well-connected to a Databricks instance. The aim here is to leverage Databricks compute (Serverless works too) which provides easy access to various Unity Catalog components (such as tables, files, and ML models).

In VS Code, connect to GitHub Copilot
Install the Databricks Extension for VSCode
Configure a Databricks project in VS Code

Define the Predictive Modeling Agent Prompt in Natural Language

Use the GitHub Copilot Agent with an elaborate plain language prompt that provides the information it needs to devise the complete solution. Here is where the actual effort really is. I will list important points to include in the agent prompt that I discovered produce a more successful outcome with less iterations.

Data Sources: Tell the Agent about the source data, and not just in technical terms but also functionally so it considers the business domain that it applies to. You can provide the table names where it will source data from in the Unity Catalog and Schema. It also helps to explain the main columns in the source tables and what the significance of each column is. This enables the agent to make more informed decisions on how to use the source data and whether it will need to transform it. The explanations also result in better feature engineering decisions to feed into the ML models.
Explain the Intended Outcome: Here is where one puts their innovative idea in words. What is the business outcome? What type of prediction are you looking for? Are there multiple insights that need to be determined? Are there certain features of the historical data that need to be given greater weight when determining the next best action or a probability of an event occurring? In addition to predicting events, are you interested in knowing the expected timeline for an event to occur?
Databricks Artifact Organization: If you’re looking to stick to standards followed in managing Databricks content, you can provide additional directions as part of the prompt. For instance, what are the exact names to use for notebooks, tables, models, etc. It also helps to be explicit in how VS Code will run the code. Instructing it to use Databricks Connect using a Default Serverless compute configuration eliminates the need to manually setup a Databricks connection through code. In addition, instructing the agent to leverage the Databricks Experiment capability to enable model accessibility through the Databricks UI ensures that one can easily monitor model progress and metrics.
ML Model Types to Consider: Experiments in Databricks are a great way of effectively comparing several algorithms simultaneously (e.g., Random Forest, XGBoost, Logistic Regression, etc.). If you have a good idea of what type of ML algorithms are applicable for your use case, you can include one or more of these in the prompt so the generated experiment is more tailored. Alternatively, let the agent recommend several ML models that are most suitable for the use case.
Operationalizing the Models: In the same prompt one can provide instructions on choosing the most accurate model, registering it in a unity catalog and applying it to new batch or stream data inferences. You can also be specific on which activities will be organized together as combined vs separate notebooks for ease of scheduling and maintenance.
Synthetic Data Generation: Sometimes data is not readily available to experiment with but one has a good idea of what it will look like. Here is where Copilot and python faker library are advantageous in synthesizing mockup data that mimic real data. This may be necessary not just for creating experiments but for testing models as well. Including instructions in the prompt for what type of synthetic data to generate allows Copilot to integrate cells in the notebook for that purpose.

With all the necessary details included in the prompt, Copilot is able to interpret the intent and generate a structured Python notebook with organized cells to handle:

Data Sourcing and Preprocessing
Feature Engineering
ML Experiment Setup
Model Training and Evaluation
Model Registration and Deployment

All of this is orchestrated from your local VS Code environment, but executed on Databricks compute, ensuring scalability and access to enterprise-grade resources.

The Benefits

Following are key benefits to this approach:

Minimal Coding Required: This applies not just for the initial model tuning and deployment but for improvement iterations also. If there is a need to tweak the model, just follow up with the Copilot Agent in VS Code to adjust the original Databricks notebooks, retest and deploy them.
Enhanced Productivity: By leveraging the Databricks Experiment APIs we’re able to automate tasks like creating experiments, logging parameters, metrics, and artifacts within training scripts, and integrate MLflow tracking into CI/CD pipelines. This allows for seamless, repeatable workflows without manual intervention. Programmatically registering, updating, and managing model versions in the MLflow Model Registry, is more streamlined through the APIs used in VS Code.
Leverage User Friendly UI Features in Databricks Experiments: Even though the ML approach described here is ultimately driven by code that is auto generated, that doesn’t mean we’re unable to take advantage of the rich Databricks Experiments UI. As the code executes in VS Code on Databricks compute, we’re able to login to the Dababricks interactive environment to inspect individual runs, review logged parameters, metrics, and artifacts, and compare different runs side-by-side to debug models or understand experimental results.

In summary, the synergy between GitHub Copilot, VS Code, and Databricks empowers users to go from idea to deployed ML models in hours, not weeks. By combining the intuitive coding assistance of GitHub Copilot with the robust infrastructure of Databricks and the flexibility of VSCode, predictive modeling becomes accessible and scalable.

Salesforce to Databricks: A Deep Dive into Integration Strategies

Mazen Manasseh — Tue, 15 Jul 2025 15:18:29 +0000

Supplementing Salesforce with Databricks as an enterprise Lakehouse solution brings advantages for various personas across an organization. Customer experience data is highly valued when it comes to driving personalized customer journeys leveraging company-wide applications beyond Salesforce. From enhanced customer satisfaction to tailored engagements and offerings that drive business renewals and expansions, the advantages are hard to miss. Databricks maps data from a variety of enterprise apps, including those used by Sales, Marketing and Finance. Consequently, layering Databricks Generative AI and predictive ML capabilities provide easily accessible best-fit recommendations that help eliminate challenges and highlight success areas within your company’s customer base.

In this blog, I elaborate on the different methods whereby Salesforce data is made accessible from within Databricks. While accessing Databricks data from Salesforce is possible, it is not the topic of this post and will perhaps be tackled in a later blog. I have focused on the built-in capabilities within both Salesforce and Databricks and have therefore excluded 3^rd party data integration platforms. There are three main ways to achieve this integration:

Databricks Lakeflow Ingestion from Salesforce
Databricks Query Federation from Salesforce Data Cloud
Databricks Files Sharing from Salesforce Data Cloud

Choosing the best approach to use depends on your use case. The decision is driven by several factors, such as the expected latency of accessing the latest Salesforce data, the complexity of the data transformations needed, and the volume of Salesforce data of interest. And it may very well be that more than one method is implemented to cater for different requirements.

While the first method copies the raw Salesforce data over to Databricks, methods 2 and 3 offer no-copy alternatives, thus leveraging Salesforce Data Cloud itself as the raw data layer. The no-copy alternatives are great in that they leverage Salesforce’s native capability of managing its own data lake thus eliminating overhead by redoing that effort. However, there are limitations to doing that, depending on the use case. The matrix below presents how each method compares when factoring in the key criteria for integration.

Method	Lakeflow Ingestion	Salesforce Data Cloud Query Federation	Salesforce Data Cloud File Sharing
Type	Data Ingestion	Zero-Copy	Zero-Copy
Supports Salesforce Data Cloud as a Source?	︎ Yes	︎ Yes	︎ Yes
Incremental Data Refreshes	︎ Automated processing into Databricks based on SF standard timestamp fields. Formula fields always require a full refresh of the formulas.	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)
Processing of Soft Deletes	︎ Yes Supported incrementally	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)
Processing of Hard Deletes	✘ Requires a full refresh	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)
Query Response Time	︎ Best as data is queried from a local copy and processed within Databricks	Slower as query response is dependent on SF Data Cloud, and data has to travel across networks	Slower as data travels across networks
Supports Real-Time Querying?	✘ No The pipeline runs on a schedule to copy data for example, hourly, daily, etc.	︎ Yes Live query execution on SF Data Cloud (Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)	︎ Yes Live data sourced from SF Data Cloud (Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)
Supports Databricks Streaming Pipelines?	︎ Yes, With Declarative Pipelines into Streaming tables (DLT) (runs as micro-batch jobs)	✘ No	✘ No
Suitable for High Data Volume?	︎ Yes SF Bulk API is called for high data volumes such as initial loads, and SF REST API is used for lower data volumes such as limited data volume incremental loads.	✘ No Reliant on JDBC Query Pushdown limitations and SF performance	Moderate This method is more suitable than Query Federation when it comes to zero-copy with high volumes of data.
Supports Data Transformation	No direct transformation. Ingests SF objects as is. Transformation happens downstream in the Declarative Pipeline.	︎ Yes. DBRX pushes queries over to Salesforce using JDBC protocol.	︎ Yes. Transformations execute on Databricks compute
Protocol	SF REST API and Bulk API over HTTPS	JDBC over HTTPS	Salesforce Data Cloud DaaS APIs over HTTPS (file-based access)
Scalability	Up to 250 objects per pipeline. Multiple pipelines are allowed.	Depending on SF Data Cloud performance when running transformation with multiple objects	Up to 250 Data Cloud objects may be included in a data share. Up to 10 data shares.
Salesforce Prerequisites	API-enabled Salesforce user with access to desired objects	Salesforce Data Cloud must be available. Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population. Enable JDBC API access to Data Cloud.	Salesforce Data Cloud must be available. Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population. Data share target is created in SF with shared objects.

If you’re looking for guidance on leveraging Databricks with Salesforce, reach out to Perficient for a discussion with Salesforce and Databricks specialists.

Databricks Lakebase – Database Branching in Action

Saravanan Ponnaiah — Fri, 04 Jul 2025 07:17:16 +0000

What is Databricks Lakebase?

Databricks Lakebase is a Postgres OLTP engine, integrated into Databricks Data Intelligence Platform. A database instance is a compute type that provides fully managed storage and compute resources for a postgres database. Lakebase leverages an architecture that separates compute and storage, which allows independent scaling while supporting low latency (<10ms) and high concurrency transactions.

Databricks has integrated this powerful postgres engine along with sophisticated capabilities that are benefited by Databricks recent acquisition of Neon. Lakebase is fully managed by Databricks, which means no infrastructure has to be provisioned and maintained separately. In addition to traditional OLTP engine, Lakebase comes with below features,

Openness: Lakebase are built on open-source standards
Storage and compute separation: Lakebase stores data in data lakes in open format. It enables scaling storage and compute independently.
Serverless: Lakebase is lightweight, meaning it can scale instantly up and down based on the load. It can scale down to zero, at which the cost of the lakebase is just for the storage of data only. No compute cost will be applied.
Modern development workflow: Branching a database is as simple as branching a code repository. It is done near instantly.
Built for AI Agents: Lakebases are designed to support a large number of AI agents. It’s branching and checkpointing capabilities enable AI agents to experiment and rewind to any point in time.
Lakehouse Integration: Lakebase make it easy to combine operational, analytical and AI systems without complex ETL pipelines.

In this article, we shall discuss in detail about how database branching feature works in Lakebase.

Database Branching

Database branching is one of the unique features introduced in Lakebase, that enables to branch out a database. It resembles the exact behavior of how code branch could be branched out from an existing branch.

Branching database is beneficial for an isolated test environment or point in time recovery. Lakebase uses copy-on-write branching mechanism to create an instant zero-copy clone of the database, with dedicated compute to operate on that branch. With zero-copy clone, it enables to create a branch of parent database of any size instantly.

The child branch is managed independently of the parent branch. With child isolated database branch, one can perform testing/debugging in the production copy of data. Though both parent and child databases appear separate, physically both instances would be pointing to same data pages. Under the hood, child database will be pointing to the actual data pages which parent is pointing to. When a change occurs in any of the data in child branch, then a new data page will be created with the new changes, and it will be available only to the branch. Any changes done in branch will not reflect in parent branch.

How branching works

The below diagrams represent how database branching works under the hood,

Lakebase in action

Here is the demonstration of how Lakebase instance can be created, branch out an instance and how table changes behave,

To create Lakebase instance, login Databricks and navigate to Compute -> OLTP Database tab -> Click “Create New Instance” button,

Click “New Query” to launch SQL Editor for PostgreSQL Database. In current instance, let’s create a new table and add some records.

Let’s create a database branch “pginstance2” from instance “pginstance1”. Goto Compute –> OLTP Database –> Create Database instance

Enter new instance name and expand “Advanced Settings” -> Enable “Create from parent” option -> Enter the source instance name “pginstance1”.

Under “Include data from parent up to”, select “Current point in time” option. Here, we can choose any specific point in time instance too.

Launch SQL Editor from pginstance2 database instance and query tbl_user_profile table

Now, let’s insert new record and update an existing record in the tbl_user_profile table in pginstance2,

Now, let’s switch back to parent database instance pginsntance1 and query tbl_user_profile table. The table in pginsntance1 should still be only 3 records. All the changes done in tbl_user_profile table should be available only in pginstance2.

Conclusion

Database changes that are done in one branch will not impact/reflect in another branch, thereby provide clear isolation of database at scale. Currently Lakebase do not have a feature to merge database branch. However, Databricks is committed and working towards database merge capability in near future.

Celebrating Perficient’s Third Databricks Champion

Nicholle Rosson — Thu, 03 Jul 2025 20:12:36 +0000

We’re excited to welcome Bamidele James as Perficient’s newest and third Databricks Champion! His technical expertise, community engagement, advocacy, and mentorship have made a profound impact on the Databricks ecosystem.

His Nomination Journey

Bamidele’s journey through the nomination process was vigorous. It required evidence that he has successfully delivered multiple Databricks projects, received several certifications, completed an instructor-led training course, and participated in a panel interview with the Databricks committee.

What This Achievement Means

This achievement represents peer and leadership recognition of Bamidele’s knowledge, contributions, and dedication to building strong partnerships. It also brings him a sense of purpose and pride to know that his work has made a real impact, and his continuous efforts are appreciated.

Contributing to Databricks’ and Perficient’s Growth

Bamidele plays a pivotal role in helping our clients unlock the full potential of Databricks by aligning Perficient’s Databricks capabilities with their business goals. He enables enterprise customers to accelerate their data and AI transformation to deliver measurable outcomes like reduced time-to-insight, improved operational efficiency, and increased revenue. In addition, Bamidele has led workshops, held executive briefings, and developed proof of concepts that help our clients drive adoption and deepen customer engagement.

“Being a Databricks Champion affirms that my contributions, knowledge, and dedication to building strong partnerships are recognized by my peers and leadership.” – Bamidele James, Technical Architect

Skills Key to This Achievement

Many skills and proficiencies—including data engineering and architecture, machine learning and AI, cloud platforms, data governance and security, solution selling, stakeholder management, and strategic thinking—played a part in Bamidele becoming a Databricks Champion. To anyone wishing to follow a similar path, Bamidele recommends mastering the platform, attaining deep technical expertise, and focusing on real-world impact.

Looking Ahead

Bamidele looks forward to using Databricks to create innovative tools and solutions that drive success for our clients. He’s also excited about trends and Databricks innovations including multi-tab notebooks, Databricks Lake Flow, the new SQL interface, and SQL pipeline syntax.

Perficient + Databricks

Perficient is proud to be a trusted Databricks elite consulting partner with more than 130 certified consultants. We specialize in delivering tailored data engineering, analytics, and AI solutions that unlock value and drive business transformation.

Learn more about our Databricks partnership.

Unlocking Business Success with Databricks One

David Callaghan — Mon, 30 Jun 2025 20:54:00 +0000

Business users don’t use notebooks. Full stop. And for that reason, most organizations don’t have business users accessing the Databricks UI. This has always been a fundamental flaw in Databricks’ push to democratize data and AI. This disconnect is almost enshrined in the medallion architecture: Bronze is for system accounts, data scientists with notebooks use the Silver layer, and Gold is for business users with reporting tools. This approach has been enough to take an organization part of the way towards self-service analytics. This approach is not working for GenAI, though. This was a major frustration with Genie Spaces. It was a tool made for business users but embedded in an IT interface. Databricks One is looking to change all that.

Using Databricks One

Databricks One is a unified platform experience that provides business users with a single point of entry into their data ecosystem. It removes technical complexity and offers a curated environment to interact with data, AI models, dashboards, and apps efficiently. Core features of Databricks One include:

AI/BI Dashboards: Users can view, explore, and drill into key KPIs and metrics without technical setup.
AI/BI Genie: A conversational AI interface allowing users to ask natural language questions like “Why did sales drop in April?” or “What are the top-performing regions?”
Custom Databricks Apps: Tailored applications that combine analytics, workflows, and AI models to meet specific business needs.
Content Browsing by Domain: Content is organized into relevant business areas such as “Customer 360” and “Marketing Campaign Performance,” fostering easy discovery and collaboration.

Administering Databricks One

Administrators can give users access to Databricks One via a consumer access entitlement. This is a basic, read-only entry point for business users that gives access to a simplified workspace that focuses on consuming dashboards, Genie spaces and Apps. Naturally, users will be working with Unity Catalog’s unified data access controls to maintain governance and security.

Conclusion

This is a very short blog because I try not to comment too early on pre-release features and Databricks One is scheduled for a beta release later this summer. This is more than just an incremental feature for a lot of our enterprise clients, though. I am looking at Databricks One as a fundamental architectural component for large enterprise implementations. I feel this is a huge step forward for practical data and intelligence democratization and I was just too excited to wait for more details.

Understanding Clean Rooms: A Comparative Analysis Between Databricks and Snowflake

David Callaghan — Fri, 27 Jun 2025 21:45:04 +0000

“Clean rooms” have emerged as a pivotal data sharing innovation with both Databricks and Snowflake providing enterprise alternatives.

Clean rooms are secure environments designed to allow multiple parties to collaborate on data analysis without exposing sensitive details of data. They serve as a sandbox where participants can perform computations on shared datasets while keeping raw data isolated and secure. Clean rooms are especially beneficial in scenarios like cross-company research collaborations, ad measurement in marketing, and secure financial data exchanges.

Uses of Clean Rooms:

Data Privacy: Ensures that sensitive information is not revealed while still enabling data analysis.
Collaborative Analytics: Allows organizations to combine insights without sharing the actual data, which is vital in sectors like finance, healthcare, and advertising.
Regulatory Compliance: Assists in meeting stringent data protection norms such as GDPR and CCPA by maintaining data sovereignty.

Clean Rooms vs. Data Sharing

While clean rooms provide an environment for secure analysis, data sharing typically involves the actual exchange of data between parties. Here are the major differences:

Security:
- Clean Rooms: Offer a higher level of security by allowing analysis without exposing raw data.
- Data Sharing: Involves sharing of datasets, which requires robust encryption and access management to ensure security.
Control:
- Clean Rooms: Data remains under the control of the originating party, and only aggregated results or specific analyses are shared.
- Data Sharing: Data consumers can retain and further use shared datasets, often requiring complex agreements on usage.
Flexibility:
- Clean Rooms: Provide flexibility in analytics without the need to copy or transfer data.
- Data Sharing: Offers more direct access, but less flexibility in data privacy management.

High-Level Comparison: Databricks vs. Snowflake

Implementation
Databricks	Snowflake
Setup and Configuration: Utilize existing Databricks workspace Create a new Clean Room environment within the workspace Configure Delta Lake tables for shared data Data Preparation: Use Databricks’ data engineering capabilities to ETL and anonymize data Leverage Delta Lake for ACID transactions and data versioning Access Control: Implement fine-grained access controls using Unity Catalog Set up row-level and column-level security Collaboration: Share Databricks notebooks for collaborative analysis Use MLflow for experiment tracking and model management Analysis: Utilize Spark for distributed computing Support for SQL, Python, R, and Scala in the same environment	Setup and Configuration: Set up a separate Snowflake account for the Clean Room Create shared databases and views Data Preparation: Use Snowflake’s data engineering features or external tools for ETL Load prepared data into Snowflake tables Access Control: Implement Snowflake’s role-based access control Use secure views and row access policies Collaboration: Share data using Snowflake Data Sharing Utilize Snowsight for basic collaborative analytics Analysis: Primarily SQL-based analysis Use Snowpark for more advanced analytics in Python or Java
Business and IT Overhead
Databricks	Snowflake
Lower overhead if already using Databricks for other data tasks Unified platform for data engineering, analytics, and ML May require more specialized skills for advanced Spark operations	Easier setup and management for pure SQL users Less overhead for traditional data warehousing tasks Might need additional tools for complex data preparation and ML workflows
Cost Considerations
Databricks	Snowflake
More flexible pricing based on compute usage Can optimize costs with proper cluster management Potential for higher costs with intensive compute operations	Predictable pricing with credit-based system Separate storage and compute pricing Costs can escalate quickly with heavy query usage
Security and Governance
Databricks	Snowflake
Unity Catalog provides centralized governance across clouds Native integration with Delta Lake for ACID compliance Comprehensive audit logging and lineage tracking	Strong built-in security features Automated data encryption and key rotation Detailed access history and query logging
Data Format and Flexibility
Databricks	Snowflake
Supports various data formats (structured, semi-structured, unstructured) Supports various file formats (Parquet, Iceberg, csv,json, images, etc.) Better suited for large-scale data processing and transformations	Optimized for structured and semi-structured data Excellent performance for SQL queries on large datasets May require additional effort for unstructured data handling
Advanced Analytics, AI and ML
Databricks	Snowflake
Native support for advanced analytics and AI/ML workflows Integrated with popular AI/ML libraries and MLflow Easier to implement end-to-end AI/ML pipeline	Requires additional tools or Snowpark for advanced analytics Integration with external ML platforms needed for comprehensive ML workflows Strengths lie more in data warehousing than in ML operations
Scalability
Databricks	Snowflake
Auto-scaling of compute clusters and serverless compute options Better suited for processing very large datasets and complex computations	Automatic scaling and performance optimization May face limitations with extremely complex analytical workloads

Use Case Example: Financial Services Research Collaboration

Consider a research department within a financial services firm that wants to collaborate with other institutions on developing market insights through data analytics. They face a challenge: sharing proprietary and sensitive financial data without compromising security or privacy. Here’s how utilizing a clean room can solve this:

Implementation in Databricks:

Integration: By setting up a clean room in Databricks, the research department can securely integrate its datasets with other institutions; allowing sharing of data insights with precise access controls.
Analysis: Researchers from various departments can perform joint analyses on combined datasets without ever directly accessing each other’s raw data.
Security and Compliance: Databricks’ security features such as encryption, audit logging, and RBAC will ensure that all collaborations comply with regulatory standards.

Through this setup, the financial services firm’s research department can achieve meaningful collaboration and derive deeper insights from joint analyses, all while maintaining data privacy and adhering to compliance requirements.

By leveraging clean rooms, organizations in highly regulated industries can unlock new opportunities for innovation and data-driven decision-making without the risks associated with traditional data sharing methods.

Conclusion

Both Databricks and Snowflake offer robust solutions for implementing this financial research collaboration use case, but with different strengths and considerations.

Databricks excels in scenarios requiring advanced analytics, machine learning, and flexible data processing, making it well-suited for research departments with diverse analytical needs. It offers a more comprehensive platform for end-to-end data science workflows and is particularly advantageous for organizations already invested in the Databricks ecosystem.

Snowflake, on the other hand, shines in its simplicity and ease of use for traditional data warehousing and SQL-based analytics. Its strong data sharing capabilities and familiar SQL interface make it an attractive option for organizations primarily focused on structured data analysis and those with less complex machine learning requirements.

Regardless of the chosen platform, the implementation of Clean Rooms represents a significant step forward in enabling secure, compliant, and productive data collaboration in the financial sector. As data privacy regulations continue to evolve and the need for cross-institutional research grows, solutions like these will play an increasingly critical role in driving innovation while protecting sensitive information.

Perficient is both a Databricks Elite Partner and a Snowflake Premier Partner. Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock your data’s full potential across your enterprise.

Transforming Your Data Strategy with Databricks Apps: A New Frontier

David Callaghan — Tue, 24 Jun 2025 21:10:30 +0000

I’ve been coding in notebooks for so long, I forgot how much I missed a nice, deployed application. I also didn’t realize how this was limiting my solution space. Then I started working with Databricks Apps.

Databricks Apps are designed to extend the functionality of the Databricks platform, providing users with enriched features and capabilities tailored to specific data needs. These apps can significantly enhance the data processing and analysis experience, offering bespoke solutions to address complex business requirements.

Key Features of Databricks Apps

Custom Solutions for Diverse Needs: Databricks Apps are built to cater to a wide range of use cases, from data transformation and orchestration to predictive analytics and AI-based insights. This versatility allows organizations to deploy applications that directly align with their specific business objectives.
Seamless Integration: The apps integrate smoothly within the existing Databricks environment, maintaining the platform’s renowned ease of use and ensuring that deployment does not disrupt current data processes. This seamless integration is crucial for maintaining operational efficiency and minimizing transition challenges.
Scalability and Flexibility: Databricks Apps are designed to scale with your organization’s needs, ensuring that as your data requirements grow, the solutions deployed through these apps can expand to meet those demands without compromising performance.
Enhanced Collaboration: By leveraging apps that foster collaboration, teams can work more effectively across different departments, sharing insights and aligning strategic goals with more precision and cohesion.

Benefits for Architects

Tailored Data Solutions: Databricks Apps enables architects to deploy tailored solutions that meet their unique data challenges, ensuring that technical capabilities are closely aligned with strategic business goals.
Accelerated Analytics Workflow: By using specialized apps, organizations can significantly speed up their data analytics workflows, leading to faster insights and more agile decision-making processes, essential in today’s fast-paced business environment.
Cost Efficiency: The capability to integrate custom-built apps reduces the need for additional third-party tools, potentially lowering overall costs and simplifying vendor management.
Future-Proofing Data Strategies: With the rapid evolution of technology, having access to a continuously expanding library of Databricks Apps helps organizations stay ahead of trends and adapt swiftly to new data opportunities and challenges.

Strategies for Effectively Leveraging Databricks Apps

To maximize the potential of Databricks Apps, CIOs and CDOs should consider the following approaches:

Identify Specific Use Cases: Before adopting new apps, identify the specific data operations and challenges your organization is facing. This targeted approach ensures that the apps you choose provide the most value.
Engage with App Developers: Collaborate with app developers who specialize in delivering comprehensive solutions tailored to your industry. Their expertise can enhance the implementation process and provide insights into best practices.
Promote Cross-Department Collaboration: Encourage departments across your organization to utilize these apps collaboratively. The synergistic use of advanced data solutions can drive more insightful analyses and foster a unified strategic direction.
Assess ROI Regularly: Continuously assess the return on investment from using Databricks Apps. This evaluation will help in determining their effectiveness and in making data-driven decisions regarding future app deployments.

Conclusion

Databricks Apps present a powerful opportunity for CIOs and CDOs to refine and advance their data strategies by offering tailored, scalable, and integrated solutions. By embracing these tools, organizations can transform their data-driven operations to gain a competitive edge in an increasingly complex business landscape.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

Exploring the Free Edition of Databricks: A Risk-Free Approach to Enterprise AI

David Callaghan — Tue, 24 Jun 2025 20:53:39 +0000

Databricks announced a full, free version of the platform at the Data and AI Summit. While the Free Edition is targeted to students and hobbyists, I also see opportunities where enterprise architects can effectively evangelize Databricks without going through Procurement for a license. Choosing the right platform to manage, analyze, and extract insights from massive datasets is crucial, especially with new and emerging GenAI use cases. We have seen many clients paralyzed by the combination of moving to a cloud database, comparing and contrasting the different offerings, and doing all of this analysis with only a very murky picture of what the new AI-driven future holds. The Community Edition has always been free, but it has not been feature-complete. With its new Free Edition, Databricks presents an exceptional opportunity for organizations to test its capabilities with no financial commitment or risk.

What is Databricks Free Edition?

The Free Edition of Databricks is designed to provide users with full access to Databricks’ core functionalities, allowing them to explore, experiment, and evaluate the platform’s potential without any initial investment. This edition is an excellent entry point for organizations looking to understand how Databricks can fit into their data strategy, providing a hands-on experience with the platform’s features.

Key Features of Databricks Free Edition

Simplified Setup and Onboarding: The Free Edition offers a straightforward setup process. Users can easily create an account and start exploring Databricks’ environment in a matter of minutes. This ease of access is ideal for decision-makers who want to quickly assess Databricks’ capabilities.
Complete Workspace Experience: Users of the Free Edition get access to a complete workspace, which includes all the necessary tools for data engineering, data science, and machine learning. This enables organizations to evaluate the entire data lifecycle on the Databricks platform.
Scalability and Performance: While the Free Edition is designed for evaluation purposes, it still provides a glimpse into the scalability and performance efficiency that Databricks is known for. Organizations can run small-scale analytics and machine learning tests to gauge how the platform handles data processing and computation tasks.
Community Support and Resources: Users can benefit from the extensive Databricks community, which offers support, tutorials, and shared resources. This can be particularly valuable for organizations exploring Databricks for the first time and wanting to leverage shared knowledge.
No Time Constraints: Unlike typical trial versions, the Free Edition does not impose a time limit, allowing organizations to explore the platform at their own pace. This flexibility is essential for CIOs and CDOs who might need extended periods to evaluate the platform’s potential fully.

Benefits for CIOs and CDOs

Risk-Free Evaluation: The primary advantage of the Free Edition is the risk-free nature of the exploration. CIOs and CDOs can test the platform’s capabilities without signing contracts or making financial commitments, aligning with their careful budget management strategies.
Strategic Insights for Data Strategy: By exploring Databricks firsthand, decision-makers can gain strategic insights into how the platform integrates with existing systems and processes. This understanding is crucial when considering a transition to a new data analytics platform.
Hands-On Experience: Direct interaction with Databricks helps bridge the gap between executive strategy and technical implementation. By experiencing the platform themselves, developers and architects can better champion its adoption across the organization.
Pre-Deployment Testing: The Free Edition enables organizations to test specific use cases and data workflows, helping identify any challenges or concerns before full deployment. This pre-deployment testing ensures that any transition to Databricks is smooth and well-informed.
Benchmarking Against Other Solutions: As organizations evaluate various data platforms, the Free Edition allows Databricks to be benchmarked against other solutions in the market. This comparison can be crucial in making informed decisions that align with long-term strategic goals.

Maximizing the Use of Databricks Free Edition

To maximize the benefits of Databricks Free Edition, CIOs and CDOs should consider the following strategies:

Define Use Cases: Before diving into the platform, define specific use cases you want to test. This could include data processing efficiency, machine learning model training, or real-time analytics capabilities. Clear objectives will provide focus and measurable outcomes.
Leverage Community Resources: Engage with the Databricks community to explore case studies, tutorials, and shared solutions that can offer fresh perspectives and innovative ideas.
Collaborate with Data Teams: Involve your data engineering and science teams early in the evaluation process. Their input and expertise will be invaluable in testing and providing feedback on the platform’s performance.
Evaluate Integration Points: During your exploration, assess how well Databricks integrates with existing systems and cloud services within your organization. Seamless integration is vital for minimizing disruption and maximizing workflow efficiency.

Conclusion

The Databricks Free Edition is an invaluable opportunity for CIOs and CDOs to explore the transformative potential of big data analytics on a leading platform without any associated risks.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.