Databricks Articles / Blogs / Perficient https://blogs.perficient.com/category/partners/databricks/ Expert Digital Insights Tue, 16 Sep 2025 14:10:17 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png Databricks Articles / Blogs / Perficient https://blogs.perficient.com/category/partners/databricks/ 32 32 30508587 Championing Innovation as a Newly Named Databricks Champion https://blogs.perficient.com/2025/09/15/championing-innovation-as-a-newly-named-databricks-champion/ https://blogs.perficient.com/2025/09/15/championing-innovation-as-a-newly-named-databricks-champion/#respond Mon, 15 Sep 2025 22:31:32 +0000 https://blogs.perficient.com/?p=387097

At Perficient, we believe that championing innovation begins with the bold leaders who live it every day. Today, we’re proud to recognize Madhu Mohan Kommu, a key driver in our Databricks Center of Excellence (CoE), for being named a Databricks Champion, one of the most coveted recognitions in the Databricks ecosystem.

This honor represents more than technical mastery; it reflects strategic impact, thought leadership, and the power to drive transformation across industries through smart architecture and scalable data solutions. Achieving Databricks Champion status unlocks priority access to exclusive events, speaking engagements, and community collaboration. It’s a mark of excellence reserved for those shaping the future of data and AI, with Madhu as a stellar example.

The Journey Behind the Recognition

Earning Champion status was no small feat. Madhu’s five-month journey began with his first Databricks certification and culminated in a nomination based on real customer impact, platform leadership, and consistent contributions to Perficient’s Databricks CoE. The nomination process spotlighted Madhu’s technical depth, thought leadership, and innovation across enterprise engagements.

From Spark to Strategy: A Legacy of Impact

Since 2019, Madhu has led initiatives for our enterprise clients, delivering platform modernization, transformation frameworks, and cutting-edge data quality solutions. His expertise in the Spark Distributed Processing Framework, combined with deep knowledge of PySpark and Unity Catalog, has made him a cornerstone in delivering high-value, AI-powered outcomes across industries.

“Personally, it’s a proud and rewarding milestone I’ve always aspired to achieve. Professionally, it elevates my credibility and brings visibility to my work in the industry. Being recognized as a Champion validates years of dedication and impact.” – Madhu Mohan Kommu, Technical Architect

Strengthening Perficient’s Position

Madhu’s recognition significantly strengthens Perficient’s role as a strategic Databricks partner, expanding our influence across regions, deepening pre-sales and enablement capabilities, and empowering customer engagement at scale. His leadership amplifies our ability to serve clients with precision and purpose.

Looking Ahead: Agentic AI & Beyond

Next up? Madhu plans to lead Perficient’s charge in Agentic AI within Databricks pipelines, designing use cases that deliver measurable savings in time, cost, and process efficiency. These efforts will drive value for both existing and future clients, making AI innovation more accessible and impactful than ever.

Advice for Future Champions

Madhu’s advice for those on a similar path is to embrace continuous learning, collaborate across teams, and actively contribute to Perficient’s Databricks CoE.

What’s Hot in Databricks Innovation

From Lakehouse Federation to Mosaic AI and DBRX, Madhu stays at the forefront of game-changing trends. He sees these innovations not just as tools, but as catalysts for redefining business intelligence.

Madhu’s story is a powerful reflection of how Perficient continues to lead with purpose, vision, and excellence in the Databricks community.

Perficient + Databricks

Perficient is proud to be a trusted Databricks elite consulting partner with 100s of certified consultants. We specialize in delivering tailored data engineering, analytics, and AI solutions that unlock value and drive business transformation.

Learn more about our Databricks partnership.

]]>
https://blogs.perficient.com/2025/09/15/championing-innovation-as-a-newly-named-databricks-champion/feed/ 0 387097
Why Databricks SQL Serverless is not PCI-DSS compliant https://blogs.perficient.com/2025/07/24/why-databricks-sql-serverless-is-not-pci-dss-compliant/ https://blogs.perficient.com/2025/07/24/why-databricks-sql-serverless-is-not-pci-dss-compliant/#respond Thu, 24 Jul 2025 08:51:47 +0000 https://blogs.perficient.com/?p=385112

Databricks supports a wide range of compliance standards to meet the needs of highly regulated industries, including:

  • HIPAA (Health Insurance Portability and Accountability Act)
  • PCI-DSS (Payment Card Industry Data Security Standard)
  • FedRAMP High & Moderate
  • DoD IL5
  • IRAP (Australia)
  • GDPR (EU)
  • CCPA (California)

However, I was surprised to read that Databricks Serverless workloads are not covered for PCI-DSS (Databricks PCI DSS Compliance | Databricks) and became curious about the reason behind it. Based on my research, I have convinced myself of the reason and would like to share it here.

To begin with, let’s understand different Databricks SQL Warehouse types and their capabilities,

Pro SQL Warehouse Classic SQL Warehouse Serverless SQL Warehouse
    • Supports Photon and Predictive IO
    • Does not support Intelligent Workload Management (IWM)
    • Compute resources present in the user cloud account layer
    • Less responsive warehouses to query the demand
    • Cannot auto-scale rapidly, and startup is ~2-4 min
    • Suitable for custom-defined networking and want to connect to databases within the user network
    • Supports Photon
    • Does not support Predictive IO and Intelligent Workload Management
    • Compute resources present in the user cloud account
    • Provides entry-level performance and is less performant than Pro and Serverless SQL Warehouses
    • Cannot auto-scale rapidly, and startup is ~4 min
    • Suitable to run interactive queries for exploration purposes with entry-level performance
    • Supports Photon, Predictive IO, Intelligent Workload Management
    • Compute resources present in the Databricks cloud account
    • Highly responsive to query demand
    • Rapid auto-scaling and rapid startup time of 4-6 seconds
    • Suitable for time-sensitive ETL, Business Intelligence, and Exploratory analysis use cases

Databricks SQL (Classic/Pro)

Classicprocompute

  • In Databricks SQL (Classic/Pro) warehouses, compute resources in the customer account will be leveraged.
  • When running workloads using Databricks SQL (Classic/Pro), data is processed by the compute resources, which are managed by the customers
  • Customers will have more control and monitoring over the compute resources
  • Data getting processed will also reside within the network boundary of the customer cloud account

Databricks SQL (Serverless)

Serverlesscompute

  • In Databricks SQL (Serverless) warehouse, compute resources in the Databricks account will be leveraged.
  • Serverless compute operates on a multi-tenant architecture, where compute resources are shared across different customers
  • Compute resources are entirely managed by Databricks, and customers will have less control and monitoring ability over the networking and compute resources.
  • Different workload data is processed within the compute resources of the Databricks account.
  • Though customers have less control over the compute, they can greatly benefit from the capabilities that Serverless warehouses exhibit

Final View

  • PCI-DSS requires strict isolation of environments handling cardholder data, which is difficult to guarantee in a shared setup
  • It mandates restricted and monitored network access, especially for systems handling payment data
  • It requires fine-grained control and auditing, which is more feasible in dedicated or customer-managed environments.
  • Databricks recommends using classic or pro clusters with dedicated VPCs, private networking, and enhanced security controls for PCI DSS-compliant workloads.
  • Additionally, Databricks dedicates effort to bring in more isolation boundaries within Serverless compute
]]>
https://blogs.perficient.com/2025/07/24/why-databricks-sql-serverless-is-not-pci-dss-compliant/feed/ 0 385112
A Recipe to Boost Predictive Modeling Efficiency https://blogs.perficient.com/2025/07/22/a-recipe-to-boost-predictive-modeling-efficiency/ https://blogs.perficient.com/2025/07/22/a-recipe-to-boost-predictive-modeling-efficiency/#respond Tue, 22 Jul 2025 18:48:29 +0000 https://blogs.perficient.com/?p=384894

Implementing predictive analytical insights has become ever so essential for organizations to operate efficiently and remain relevant. What is important while doing this though is to be agile and adaptable. This is much so because what holds valid for a period can easily become obsolete with time. And what is characteristic of a specific group of customers, for example, varies widely with a diverse audience. Therefore, going from an envisioned innovative business idea to a working AI/ML model requires a mechanism that allows for a rapid and AI-driven approach.

In this post, I explain how Databricks, GitHub Copilot and Visual Studio Code IDE (VS Code) together offer an elevated experience when it comes to implementing predictive ML models efficiently. Even with minimal coding and data science experience, one can build, test and deploy predictive models. The synergy I’ve seen that GitHub Copilot has from within VS Code with MLflow and Databricks Experiments is remarkable. Here is how this approach goes.

Prerequisites

Before starting, there are a few one-time setup steps to configure VS Code so it’s well-connected to a Databricks instance. The aim here is to leverage Databricks compute (Serverless works too) which provides easy access to various Unity Catalog components (such as tables, files, and ML models).

Define the Predictive Modeling Agent Prompt in Natural Language

Use the GitHub Copilot Agent with an elaborate plain language prompt that provides the information it needs to devise the complete solution. Here is where the actual effort really is. I will list important points to include in the agent prompt that I discovered produce a more successful outcome with less iterations.

  • Data Sources: Tell the Agent about the source data, and not just in technical terms but also functionally so it considers the business domain that it applies to. You can provide the table names where it will source data from in the Unity Catalog and Schema. It also helps to explain the main columns in the source tables and what the significance of each column is. This enables the agent to make more informed decisions on how to use the source data and whether it will need to transform it. The explanations also result in better feature engineering decisions to feed into the ML models.
  • Explain the Intended Outcome: Here is where one puts their innovative idea in words. What is the business outcome? What type of prediction are you looking for? Are there multiple insights that need to be determined? Are there certain features of the historical data that need to be given greater weight when determining the next best action or a probability of an event occurring? In addition to predicting events, are you interested in knowing the expected timeline for an event to occur?
  • Databricks Artifact Organization: If you’re looking to stick to standards followed in managing Databricks content, you can provide additional directions as part of the prompt. For instance, what are the exact names to use for notebooks, tables, models, etc. It also helps to be explicit in how VS Code will run the code. Instructing it to use Databricks Connect using a Default Serverless compute configuration eliminates the need to manually setup a Databricks connection through code. In addition, instructing the agent to leverage the Databricks Experiment capability to enable model accessibility through the Databricks UI ensures that one can easily monitor model progress and metrics.
  • ML Model Types to Consider: Experiments in Databricks are a great way of effectively comparing several algorithms simultaneously (e.g., Random Forest, XGBoost, Logistic Regression, etc.). If you have a good idea of what type of ML algorithms are applicable for your use case, you can include one or more of these in the prompt so the generated experiment is more tailored. Alternatively, let the agent recommend several ML models that are most suitable for the use case.
  • Operationalizing the Models: In the same prompt one can provide instructions on choosing the most accurate model, registering it in a unity catalog and applying it to new batch or stream data inferences. You can also be specific on which activities will be organized together as combined vs separate notebooks for ease of scheduling and maintenance.
  • Synthetic Data Generation: Sometimes data is not readily available to experiment with but one has a good idea of what it will look like. Here is where Copilot and python faker library are advantageous in synthesizing mockup data that mimic real data. This may be necessary not just for creating experiments but for testing models as well. Including instructions in the prompt for what type of synthetic data to generate allows Copilot to integrate cells in the notebook for that purpose.

With all the necessary details included in the prompt, Copilot is able to interpret the intent and generate a structured Python notebook with organized cells to handle:

  • Data Sourcing and Preprocessing
  • Feature Engineering
  • ML Experiment Setup
  • Model Training and Evaluation
  • Model Registration and Deployment

All of this is orchestrated from your local VS Code environment, but executed on Databricks compute, ensuring scalability and access to enterprise-grade resources.

The Benefits

Following are key benefits to this approach:

  • Minimal Coding Required: This applies not just for the initial model tuning and deployment but for improvement iterations also. If there is a need to tweak the model, just follow up with the Copilot Agent in VS Code to adjust the original Databricks notebooks, retest and deploy them.
  • Enhanced Productivity: By leveraging the Databricks Experiment APIs we’re able to automate tasks like creating experiments, logging parameters, metrics, and artifacts within training scripts, and integrate MLflow tracking into CI/CD pipelines. This allows for seamless, repeatable workflows without manual intervention. Programmatically registering, updating, and managing model versions in the MLflow Model Registry, is more streamlined through the APIs used in VS Code.
  • Leverage User Friendly UI Features in Databricks Experiments: Even though the ML approach described here is ultimately driven by code that is auto generated, that doesn’t mean we’re unable to take advantage of the rich Databricks Experiments UI. As the code executes in VS Code on Databricks compute, we’re able to login to the Dababricks interactive environment to inspect individual runs, review logged parameters, metrics, and artifacts, and compare different runs side-by-side to debug models or understand experimental results.

In summary, the synergy between GitHub Copilot, VS Code, and Databricks empowers users to go from idea to deployed ML models in hours, not weeks. By combining the intuitive coding assistance of GitHub Copilot with the robust infrastructure of Databricks and the flexibility of VSCode, predictive modeling becomes accessible and scalable.

]]>
https://blogs.perficient.com/2025/07/22/a-recipe-to-boost-predictive-modeling-efficiency/feed/ 0 384894
Salesforce to Databricks: A Deep Dive into Integration Strategies https://blogs.perficient.com/2025/07/15/salesforce-to-databricks-a-deep-dive-into-integration-strategies/ https://blogs.perficient.com/2025/07/15/salesforce-to-databricks-a-deep-dive-into-integration-strategies/#respond Tue, 15 Jul 2025 15:18:29 +0000 https://blogs.perficient.com/?p=384550

Supplementing Salesforce with Databricks as an enterprise Lakehouse solution brings advantages for various personas across an organization. Customer experience data is highly valued when it comes to driving personalized customer journeys leveraging company-wide applications beyond Salesforce. From enhanced customer satisfaction to tailored engagements and offerings that drive business renewals and expansions, the advantages are hard to miss. Databricks maps data from a variety of enterprise apps, including those used by Sales, Marketing and Finance. Consequently, layering Databricks Generative AI and predictive ML capabilities provide easily accessible best-fit recommendations that help eliminate challenges and highlight success areas within your company’s customer base.

In this blog, I elaborate on the different methods whereby Salesforce data is made accessible from within Databricks. While accessing Databricks data from Salesforce is possible, it is not the topic of this post and will perhaps be tackled in a later blog. I have focused on the built-in capabilities within both Salesforce and Databricks and have therefore excluded 3rd party data integration platforms. There are three main ways to achieve this integration:

  1. Databricks Lakeflow Ingestion from Salesforce
  2. Databricks Query Federation from Salesforce Data Cloud
  3. Databricks Files Sharing from Salesforce Data Cloud

Choosing the best approach to use depends on your use case. The decision is driven by several factors, such as the expected latency of accessing the latest Salesforce data, the complexity of the data transformations needed, and the volume of Salesforce data of interest. And it may very well be that more than one method is implemented to cater for different requirements.

While the first method copies the raw Salesforce data over to Databricks, methods 2 and 3 offer no-copy alternatives, thus leveraging Salesforce Data Cloud itself as the raw data layer. The no-copy alternatives are great in that they leverage Salesforce’s native capability of managing its own data lake thus eliminating overhead by redoing that effort. However, there are limitations to doing that, depending on the use case. The matrix below presents how each method compares when factoring in the key criteria for integration.

Method Lakeflow Ingestion Salesforce Data Cloud Query Federation Salesforce Data Cloud File Sharing
Type Data Ingestion Zero-Copy Zero-Copy
Supports Salesforce Data Cloud as a Source? ✔ Yes ✔ Yes ✔ Yes
Incremental Data Refreshes ✔ Automated processing into Databricks based on SF standard timestamp fields. Formula fields always require a full refresh of the formulas. ✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
Processing of Soft Deletes ✔ Yes Supported incrementally ✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
Processing of Hard Deletes Requires a full refresh ✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
✔ Automated in SF Data Cloud
(Requires custom handling if copying to Databricks)
Query Response Time ✔ Best as data is queried from a local copy and processed within Databricks ⚠ Slower as query response is dependent on SF Data Cloud, and data has to travel across networks ⚠ Slower as data travels across networks
Supports Real-Time Querying? No

The pipeline runs on a schedule to copy data for example, hourly, daily, etc.

✔ Yes

Live query execution on SF Data Cloud
(Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)

✔ Yes

Live data sourced from SF Data Cloud
(Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)

Supports Databricks Streaming Pipelines? ✔ Yes, With Declarative Pipelines into Streaming tables (DLT) (runs as micro-batch jobs) No No
Suitable for High Data Volume? ✔ Yes
SF Bulk API is called for high data volumes such as initial loads, and SF REST API is used for lower data volumes such as limited data volume incremental loads.
No
Reliant on JDBC Query Pushdown limitations and SF performance
⚠ Moderate
This method is more suitable than Query Federation when it comes to zero-copy with high volumes of data.
Supports Data Transformation ⚠ No direct transformation. Ingests SF objects as is. Transformation happens downstream in the Declarative Pipeline. ✔ Yes. DBRX pushes queries over to Salesforce using JDBC protocol. ✔ Yes. Transformations execute on Databricks compute
Protocol SF REST API and Bulk API over HTTPS JDBC over HTTPS Salesforce Data Cloud DaaS APIs over HTTPS (file-based access)
Scalability Up to 250 objects per pipeline. Multiple pipelines are allowed. Depending on SF Data Cloud performance when running transformation with multiple objects Up to 250 Data Cloud objects may be included in a data share. Up to 10 data shares.
Salesforce Prerequisites API-enabled Salesforce user with access to desired objects Salesforce Data Cloud must be available.

Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population.

Enable JDBC API access to Data Cloud.

Salesforce Data Cloud must be available.

Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population.

Data share target is created in SF with shared objects.

If you’re looking for guidance on leveraging Databricks with Salesforce, reach out to Perficient for a discussion with Salesforce and Databricks specialists.

]]>
https://blogs.perficient.com/2025/07/15/salesforce-to-databricks-a-deep-dive-into-integration-strategies/feed/ 0 384550
Databricks Lakebase – Database Branching in Action https://blogs.perficient.com/2025/07/04/databricks-lakebase-database-branching-in-action/ https://blogs.perficient.com/2025/07/04/databricks-lakebase-database-branching-in-action/#respond Fri, 04 Jul 2025 07:17:16 +0000 https://blogs.perficient.com/?p=383982

What is Databricks Lakebase?

Databricks Lakebase is a Postgres OLTP engine, integrated into Databricks Data Intelligence Platform. A database instance is a compute type that provides fully managed storage and compute resources for a postgres database. Lakebase leverages an architecture that separates compute and storage, which allows independent scaling while supporting low latency (<10ms) and high concurrency transactions.

Databricks has integrated this powerful postgres engine along with sophisticated capabilities that are benefited by Databricks recent acquisition of Neon. Lakebase is fully managed by Databricks, which means no infrastructure has to be provisioned and maintained separately. In addition to traditional OLTP engine, Lakebase comes with below features,

  • Openness: Lakebase are built on open-source standards
  • Storage and compute separation: Lakebase stores data in data lakes in open format. It enables scaling storage and compute independently.
  • Serverless: Lakebase is lightweight, meaning it can scale instantly up and down based on the load. It can scale down to zero, at which the cost of the lakebase is just for the storage  of data only. No compute cost will be applied.
  • Modern development workflow: Branching a database is as simple as branching a code repository. It is done near instantly.
  • Built for AI Agents: Lakebases are designed to support a large number of AI agents. It’s branching and checkpointing capabilities enable AI agents to experiment and rewind to any point in time.
  • Lakehouse Integration: Lakebase make it easy to combine operational, analytical and AI systems without complex ETL pipelines.

In this article, we shall discuss in detail about how database branching feature works in Lakebase.

Database Branching

Database branching is one of the unique features introduced in Lakebase, that enables to branch out a database. It resembles the exact behavior of how code branch could be branched out from an existing branch.

Branching database is beneficial for an isolated test environment or point in time recovery. Lakebase uses copy-on-write branching mechanism to create an instant zero-copy clone of the database, with dedicated compute to operate on that branch. With zero-copy clone, it enables to create a branch of parent database of any size instantly.

The child branch is managed independently of the parent branch. With child isolated database branch, one can perform testing/debugging in the production copy of data. Though both parent and child databases appear separate, physically both instances would be pointing to same data pages. Under the hood, child database will be pointing to the actual data pages which parent is pointing to. When a change occurs in any of the data in child branch, then a new data page will be created with the new changes, and it will be available only to the branch. Any changes done in branch will not reflect in parent branch.

How branching works

The below diagrams represent how database branching works under the hood,

Database Branching

Database Branching Updates

Lakebase in action

Here is the demonstration of how Lakebase instance can be created, branch out an instance and how table changes behave,

To create Lakebase instance, login Databricks and navigate to Compute -> OLTP Database tab -> Click “Create New Instance” button,

Create New Instance 01

Create New Instance Success 02

Click “New Query” to launch SQL Editor for PostgreSQL Database. In current instance, let’s create a new table and add some records.

Instance1 Create Table 03

Instance1 Query Table 04

Let’s create a database branch “pginstance2” from instance “pginstance1”. Goto Compute –> OLTP Database –> Create Database instance

Enter new instance name and expand “Advanced Settings” -> Enable “Create from parent” option -> Enter the source instance name “pginstance1”.

Under “Include data from parent up to”, select “Current point in time” option. Here, we can choose any specific point in time instance too.

Create Instance2 05

Instance2 Create Success 06

Launch SQL Editor from pginstance2 database instance and query tbl_user_profile table

Instance2 Query Table 07

Now, let’s insert new record and update an existing record in the tbl_user_profile table in pginstance2,

Instance2 Update Table 08

Now, let’s switch back to parent database instance pginsntance1 and query tbl_user_profile table. The table in pginsntance1 should still be only 3 records. All the changes done in tbl_user_profile table should be available only in pginstance2.

Instance1 Query Table 09

Conclusion

Database changes that are done in one branch will not impact/reflect in another branch, thereby provide clear isolation of database at scale. Currently Lakebase do not have a feature to merge database branch. However, Databricks is committed and working towards database merge capability in near future.

]]>
https://blogs.perficient.com/2025/07/04/databricks-lakebase-database-branching-in-action/feed/ 0 383982
Celebrating Perficient’s Third Databricks Champion https://blogs.perficient.com/2025/07/03/celebrating-perficients-third-databricks-champion/ https://blogs.perficient.com/2025/07/03/celebrating-perficients-third-databricks-champion/#comments Thu, 03 Jul 2025 20:12:36 +0000 https://blogs.perficient.com/?p=383928

We’re excited to welcome Bamidele James as Perficient’s newest and third Databricks Champion!  His technical expertise, community engagement, advocacy, and mentorship have made a profound impact on the Databricks ecosystem.

His Nomination Journey

Bamidele’s journey through the nomination process was vigorous. It required evidence that he has successfully delivered multiple Databricks projects, received several certifications, completed an instructor-led training course, and participated in a panel interview with the Databricks committee.

What This Achievement Means

This achievement represents peer and leadership recognition of Bamidele’s knowledge, contributions, and dedication to building strong partnerships. It also brings him a sense of purpose and pride to know that his work has made a real impact, and his continuous efforts are appreciated.

Contributing to Databricks’ and Perficient’s Growth

Bamidele plays a pivotal role in helping our clients unlock the full potential of Databricks by aligning Perficient’s Databricks capabilities with their business goals. He enables enterprise customers to accelerate their data and AI transformation to deliver measurable outcomes like reduced time-to-insight, improved operational efficiency, and increased revenue. In addition, Bamidele has led workshops, held executive briefings, and developed proof of concepts that help our clients drive adoption and deepen customer engagement.

“Being a Databricks Champion affirms that my contributions, knowledge, and dedication to building strong partnerships are recognized by my peers and leadership.” – Bamidele James, Technical Architect

 Skills Key to This Achievement

Many skills and proficiencies—including data engineering and architecture, machine learning and AI, cloud platforms, data governance and security, solution selling, stakeholder management, and strategic thinking—played a part in Bamidele becoming a Databricks Champion. To anyone wishing to follow a similar path, Bamidele recommends mastering the platform, attaining deep technical expertise, and focusing on real-world impact.

Looking Ahead

Bamidele looks forward to using Databricks to create innovative tools and solutions that drive success for our clients. He’s also excited about trends and Databricks innovations including multi-tab notebooks, Databricks Lake Flow, the new SQL interface, and SQL pipeline syntax.

Perficient + Databricks

Perficient is proud to be a trusted Databricks elite consulting partner with more than 130 certified consultants. We specialize in delivering tailored data engineering, analytics, and AI solutions that unlock value and drive business transformation.

Learn more about our Databricks partnership.

 

 

 

]]>
https://blogs.perficient.com/2025/07/03/celebrating-perficients-third-databricks-champion/feed/ 1 383928
Exploring Lakebase: Databricks’ Next-Gen AI-Native OLTP Database https://blogs.perficient.com/2025/06/22/introduction-to-databricks-lakebase-for-ai-driven-applications/ https://blogs.perficient.com/2025/06/22/introduction-to-databricks-lakebase-for-ai-driven-applications/#respond Mon, 23 Jun 2025 01:02:29 +0000 https://blogs.perficient.com/?p=383327

Lakebase is Databricks‘ OLTP database and the latest member of its ML/AI offering. Databricks has incorporated various components to support its AI platform, including data components. The Feature Store has been available for some time as a governed, centralized repository that manages machine learning features throughout their lifecycle. Mosaic AI Vector Search is a vector index optimized for storing and retrieving embeddings, particularly for similarity searches and RAG use cases.

What’s Old is New Again

AI’s need for data demands that transactional and analytical workflows no longer be viewed as separate entities. Traditional OLTP databases were never designed to meet the speed and flexibility required by AI applications today. They often exist outside analytics frameworks, creating bottlenecks and requiring manual data integrations. Notably, databases are now being spun up by AI agents rather than human operators. The robustness of the transactional database’s query response time now needs to be augmented with an equally robust administrative response time.

Lakebase addresses these challenges by revolutionizing OLTP database architecture. Its core attributes—separation of storage and compute, openness, and serverless architecture—make it a powerful tool for modern developers and data engineers.

Key Features of Lakebase

1. Openness:

Built on the open-source Postgres framework, Lakebase ensures compatibility and avoids vendor lock-in. The open ecosystem promotes innovation and provides a versatile foundation for building sophisticated data applications.

2. Separation of Storage and Compute:

Lakebase allows independent scaling of storage and computation, reducing costs and improving efficiency. Data is stored in open formats within data lakes, offering flexibility and eliminating proprietary data lock-in.

3. Serverless Architecture:

Lakebase is designed for elasticity. It scales up or down automatically, even to zero, ensuring you’re only paying for what you use, making it a cost-effective solution.

4. Integrated with AI and the Lakehouse:

Swift integration with the Lakehouse platform means no need for complex ETL pipelines. Operational and analytical data flows are synchronized in real-time, providing a seamless experience for deploying AI and machine learning models.

5. AI-Ready:

The database design caters specifically to AI agents, facilitating massive AI team operations through branching and checkpoint capabilities. This makes development, experimentation, and deployment faster and more reliable.

Use Cases and Benefits

1. Real-Time Applications:

From e-commerce systems managing inventory while providing instant recommendations, to financial services executing automated trades, Lakebase supports low-latency operations critical for real-time decision-making.

2. AI and Machine Learning:

With built-in AI and machine learning capabilities, Lakebase supports feature engineering and real-time model serving, thus accelerating AI project deployments.

3. Industry Applications:

Different sectors like healthcare, retail, and manufacturing can leverage Lakebase’s seamless data integration to enhance workflows, improve customer relations, and automate processes based on real-time insights.

Getting Started with Lakebase

Setting up Lakebase on Databricks is a straightforward process. With a few clicks, users can provision PostgreSQL-compatible instances and begin exploring powerful data solutions. Key setup steps include enabling Lakebase in the Admin Console, configuring database instances, and utilizing the Lakebase dashboard for management.

Conclusion

Lakebase is not just a database; it’s a paradigm shift for OLTP systems in the age of AI. By integrating seamless data flow, offering flexible scaling, and supporting advanced AI capabilities, Lakebase empowers organizations to rethink and innovate their data architecture. Now is the perfect moment to explore Lakebase, unlocking new possibilities for intelligent and real-time data applications.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/06/22/introduction-to-databricks-lakebase-for-ai-driven-applications/feed/ 0 383327
Mastering Databricks Jobs API: Build and Orchestrate Complex Data Pipelines https://blogs.perficient.com/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/ https://blogs.perficient.com/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/#respond Fri, 06 Jun 2025 18:45:09 +0000 https://blogs.perficient.com/?p=382492

In this post, we’ll dive into orchestrating data pipelines with the Databricks Jobs API, empowering you to automate, monitor, and scale workflows seamlessly within the Databricks platform.

Why Orchestrate with Databricks Jobs API?

When data pipelines become complex involving multiple steps—like running notebooks, updating Delta tables, or training machine learning models—you need a reliable way to automate and manage them with ease. The Databricks Jobs API offers a flexible and efficient way to automate your jobs/workflows directly within Databricks or from external systems (for example AWS Lambda or Azure Functions) using the API endpoints.

Unlike external orchestrators such as Apache Airflow, Dagster etc., which require separate infrastructure and integration, the Jobs API is built natively into the Databricks platform. And the best part? It doesn’t cost anything extra. The Databricks Jobs API allows you to fully manage the lifecycle of your jobs/workflows using simple HTTP requests.

Below is the list of API endpoints for the CRUD operations on the workflows:

  • Create: Set up new jobs with defined tasks and configurations via the POST /api/2.1/jobs/create Define single or multi-task jobs, specifying the tasks to be executed (e.g., notebooks, JARs, Python scripts), their dependencies, and the compute resources.
  • Retrieve: Access job details, check statuses, and review run logs using GET /api/2.1/jobs/get or GET /api/2.1/jobs/list.
  • Update: Change job settings such as parameters, task sequences, or cluster details through POST /api/2.1/jobs/update and /api/2.1/jobs/reset.
  • Delete: Remove jobs that are no longer required using POST /api/2.1/jobs/delete.

These full CRUD capabilities make the Jobs API a powerful tool to automate job management completely, from creation and monitoring to modification and deletion—eliminating the need for manual handling.

Key components of a Databricks Job

  • Tasks: Individual units of work within a job, such as running a notebook, JAR, Python script, or dbt task. Jobs can have multiple tasks with defined dependencies and conditional execution.
  • Dependencies: Relationships between tasks that determine the order of execution, allowing you to build complex workflows with sequential or parallel steps.
  • Clusters: The compute resources on which tasks run. These can be ephemeral job clusters created specifically for the job or existing all-purpose clusters shared across jobs.
  • Retries: Configuration to automatically retry failed tasks to improve job reliability.
  • Scheduling: Options to run jobs on cron-based schedules, triggered events, or on demand.
  • Notifications: Alerts for job start, success, or failure to keep teams informed.

Getting started with the Databricks Jobs API

Before leveraging the Databricks Jobs API for orchestration, ensure you have access to a Databricks workspace, a valid Personal Access Token (PAT), and sufficient privileges to manage compute resources and job configurations. This guide will walk through key CRUD operations and relevant Jobs API endpoints for robust workflow automation.

1. Creating a New Job/Workflow:

To create a job, you send a POST request to the /api/2.1/jobs/create endpoint with a JSON payload defining the job configuration.

{
  "name": "Ingest-Sales-Data",
  "tasks": [
    {
      "task_key": "Ingest-CSV-Data",
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 30 9 * * ?",
    "timezone_id": "UTC",
    "pause_status": "UNPAUSED"
  },
  "email_notifications": {
    "on_failure": [
      "name@email.com"
    ]
  }
}

This JSON payload defines a Databricks job that executes a notebook-based task on a newly provisioned cluster, scheduled to run daily at 9:30 AM UTC. The components of the payload are explained below:

  • name: The name of your job.
  • tasks: An array of tasks to be executed. A job can have one or more tasks.
    • task_key: A unique identifier for the task within the job. Used for defining dependencies.
    • notebook_task: Specifies a notebook task. Other task types include spark_jar_task, spark_python_task, spark_submit_task, pipeline_task, etc.
      • notebook_path: The path to the notebook in your Databricks workspace.
      • source: The source of the notebook (e.g., WORKSPACE, GIT).
    • new_cluster: Defines the configuration for a new cluster that will be created for this job run. You can also use existing_cluster_id to use an existing all-purpose cluster (though new job clusters are recommended).
      • spark_version, node_type_id, num_workers: Standard cluster configuration options.
  • schedule: Defines the job schedule using a cron expression and timezone.
  • email_notifications: Configures email notifications for job events.

To create a Databricks workflow, the above JSON payload can be included in the body of a POST request sent to the Jobs API’s create endpoint—either using curl or programmatically via the Python requests library as shown below:

Using Curl:

curl -X POST \
  https://<databricks-instance>.cloud.databricks.com/api/2.1/jobs/create \
  -H "Authorization: Bearer <Your-PAT>" \
  -H "Content-Type: application/json" \
  -d '@workflow_config.json' #Place the above payload in workflow_config.json

Using Python requests library:

import requests
import json
create_response = requests.post("https://<databricks-instance>.cloud.databricks.com/api/2.1/jobs/create", data=json.dumps(your_json_payload), auth=("token", token))
if create_response.status_code == 200:
    job_id = json.loads(create_response.content.decode('utf-8'))["job_id"]
    print("Job created with id: {}".format(job_id))
else:
    print("Job creation failed with status code: {}".format(create_response.status_code))
    print(create_response.text)

The above example demonstrated a basic single-task workflow. However, the full potential of the Jobs API lies in orchestrating multi-task workflows with dependencies. The tasks array in the job payload allows you to configure multiple dependent tasks.
For example, the following workflow defines three tasks that execute sequentially: Ingest-CSV-DataTransform-Sales-DataWrite-to-Delta.

{
  "name": "Ingest-Sales-Data-Pipeline",
  "tasks": [
    {
      "task_key": "Ingest-CSV-Data",
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    },
    {
      "task_key": "Transform-Sales-Data",
      "depends_on": [
        {
          "task_key": "Ingest-CSV-Data"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/transform_sales_data",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    },
    {
      "task_key": "Write-to-Delta",
      "depends_on": [
        {
          "task_key": "Transform-Sales-Data"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
        "source": "WORKSPACE"
      },
      "new_cluster": {
        "spark_version": "15.4.x-scala2.12",
        "node_type_id": "i3.xlarge",
        "num_workers": 2
      }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 30 9 * * ?",
    "timezone_id": "UTC",
    "pause_status": "UNPAUSED"
  },
  "email_notifications": {
    "on_failure": [
      "name@email.com"
    ]
  }
}

 

Picture1


2. Updating Existing Workflows:

For modifying existing workflows, we have two endpoints: the update endpoint /api/2.1/jobs/update and the reset endpoint /api/2.1/jobs/reset. The update endpoint applies a partial update to your job. This means you can tweak parts of the job — like adding a new task or changing a cluster spec — without redefining the entire workflow. While the reset endpoint does a complete overwrite of the job configuration. Therefore, when resetting a job, you must provide the entire desired job configuration, including any settings you wish to keep unchanged, to avoid them being overwritten or removed entirely. Let us go over a few examples to better understand the endpoints better.

2.1. Update Workflow Name & Add New Task:

Let us modify the above workflow by renaming it from Ingest-Sales-Data-Pipeline to Sales-Workflow-End-to-End, adding an input parametersource_location to the Ingest-CSV-Data, and introducing a new task Write-to-Postgres, which runs after the successful completion of Transform-Sales-Data.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3://<bucket>/<key>"
          },
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

Picture2

2.2. Update Cluster Configuration:

Cluster startup can take several minutes, especially for larger, more complex clusters. Sharing the same cluster allows subsequent tasks to start immediately after previous ones complete, speeding up the entire workflow. Parallel tasks can also run concurrently sharing the same cluster resources efficiently. Let’s update the above workflow to share the same cluster between all the tasks.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "job_clusters": [
      {
        "job_cluster_key": "shared-cluster",
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3://<bucket>/<key>"
          },
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path": "/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

Picture3

2.3. Update Task Dependencies:

Let’s add a new task named Enrich-Sales-Data and update the dependency as shown below:
Ingest-CSV-Data →
Enrich-Sales-Data → Transform-Sales-Data →[Write-to-Delta, Write-to-Postgres].Since we are updating dependencies of existing tasks, we need to use the reset endpoint /api/2.1/jobs/reset.

{
  "job_id": 947766456503851,
  "new_settings": {
    "name": "Sales-Workflow-End-to-End",
    "job_clusters": [
      {
        "job_cluster_key": "shared-cluster",
        "new_cluster": {
          "spark_version": "15.4.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "num_workers": 2
        }
      }
    ],
    "tasks": [
      {
        "task_key": "Ingest-CSV-Data",
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/ingest_csv_notebook",
          "base_parameters": {
            "source_location": "s3://<bucket>/<key>"
          },
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Enrich-Sales-Data",
        "depends_on": [
          {
            "task_key": "Ingest-CSV-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/enrich_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Transform-Sales-Data",
        "depends_on": [
          {
            "task_key": "Enrich-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/transform_sales_data",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Delta",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_delta_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      },
      {
        "task_key": "Write-to-Postgres",
        "depends_on": [
          {
            "task_key": "Transform-Sales-Data"
          }
        ],
        "notebook_task": {
          "notebook_path":"/Users/name@email.com/write_to_postgres_notebook",
          "source": "WORKSPACE"
        },
        "job_cluster_key": "shared-cluster"
      }
    ],
    "schedule": {
      "quartz_cron_expression": "0 30 9 * * ?",
      "timezone_id": "UTC",
      "pause_status": "UNPAUSED"
    },
    "email_notifications": {
      "on_failure": [
        "name@email.com"
      ]
    }
  }
}

Picture4

The update endpoint is useful for minor modifications like updating the workflow name, updating the notebook path, input parameters to tasks, updating the job schedule, changing cluster configurations like node count etc., while the reset endpoint should be used for deleting existing tasks, redefining task dependencies, renaming tasks etc.
The update endpoint does not delete tasks or settings you omit i.e. tasks not mentioned in the request will remain unchanged, while the reset endpoint removes/deletes any fields or tasks not included in the request.

3. Trigger an Existing Job/Workflow:

Use the/api/2.1/jobs/run-now endpoint to trigger a job run on demand. Pass the input parameters to your notebook tasks using thenotebook_paramsfield.

curl -X POST https://<databricks-instance>/api/2.1/jobs/run-now \
  -H "Authorization: Bearer <DATABRICKS_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "job_id": 947766456503851,
    "notebook_params": {
      "source_location": "s3://<bucket>/<key>"
    }
  }'

4. Get Job Status:

To check the status of a specific job run, use the /api/2.1/jobs/runs/get endpoint with the run_id. The response includes details about the run, including its state (e.g., PENDING, RUNNING, COMPLETED, FAILED etc).

curl -X GET \
  https://<databricks-instance>.cloud.databricks.com/api/2.1/jobs/runs/get?run_id=<your-run-id> \
  -H "Authorization: Bearer <Your-PAT>"

5. Delete Job:

To remove an existing Databricks workflow, simply call the DELETE /api/2.1/jobs/delete endpoint using the Jobs API. This allows you to programmatically clean up outdated or unnecessary jobs as part of your pipeline management strategy.

curl -X POST https://<databricks-instance>/api/2.1/jobs/delete \
  -H "Authorization: Bearer <DATABRICKS_PERSONAL_ACCESS_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "job_id": 947766456503851 }'

 

Conclusion:

The Databricks Jobs API empowers data engineers to orchestrate complex workflows natively, without relying on external scheduling tools. Whether you’re automating notebook runs, chaining multi-step pipelines, or integrating with CI/CD systems, the API offers fine-grained control and flexibility. By mastering this API, you’re not just building workflows—you’re building scalable, production-grade data pipelines that are easier to manage, monitor, and evolve.

]]>
https://blogs.perficient.com/2025/06/06/mastering-databricks-jobs-api-build-and-orchestrate-complex-data-pipelines/feed/ 0 382492
LatAm’s First Databricks Champion at Perficient https://blogs.perficient.com/2025/06/04/latams-first-databricks-champion-at-perficient/ https://blogs.perficient.com/2025/06/04/latams-first-databricks-champion-at-perficient/#respond Wed, 04 Jun 2025 19:16:45 +0000 https://blogs.perficient.com/?p=382370

We are thrilled to announce that Juan Cardona Ramirez has been recognized as a Databricks Partner Champion, making him the first Perficient colleague in Latin America to earn this prestigious designation. This recognition celebrates Juan’s deep technical expertise, consistent contributions, and commitment to innovation within the Databricks ecosystem.

 

Program Overview

The Databricks Partner Champion Program is a selective initiative that highlights individuals who exhibit thought leadership, demonstrate technical mastery, inspire and guide the community and deliver innovation and value. 

Champions go through a rigorous process that includes multiple certifications, in-depth training, and continuous contributions to both internal teams and external communities. The goal is to elevate professionals who are not only skilled with the Databricks platform but also help others grow through knowledge sharing, mentorship, and hands-on leadership.

 

Juan’s Journey to Champion Status

Juan’s path to becoming a Partner Champion spanned nearly two years and required significant dedication. The nomination process was initiated by a fellow Champion—David Callaghan, Perficient’s first Databricks Champion—and required Juan to: 

  • Obtain basic, associate, and professional-level certifications 
  • Complete the three-day Architect Essentials course 
  • Demonstrate continuous contributions within Perficient and the broader Databricks community 

 More than just checking boxes, Juan’s journey was driven by proactivity, discipline, and a relentless commitment to continuous learning. He consistently led internal study groups and hosted “chill and learn” sessions to help colleagues understand best practices and maximize the platform’s potential. 

 “I’m feeling grateful for this. All the effort is recognized, and it’s a big achievement for my career. I’m really proud of it, and my main objective is to get more people into this position.”  — Juan Cardona Ramirez, Technical Consultant. 

 

Impact and Vision

Juan’s recognition has significant implications—not just for his career, but for Perficient’s presence in Latin America and for the broader Databricks ecosystem in the region. His achievement reflects the growing technical excellence and leadership across our global teams. 

He has played a key role in promoting Databricks adoption with clients, aligning technology with business needs, and championing best practices. Now, as a Champion, Juan is focused on: 

  • Improving Perficient’s Databricks practice across LatAm 
  • Making the certification and development paths more accessible for aspiring engineers 
  • Leading mentorship and enablement initiatives through Perficient’s Databricks Center of Excellence 

 

Looking Ahead

Juan is especially excited about the latest innovations from Databricks, including: 

  • Advancements across the GenAI solution space 
  • Unity Catalog and enhanced data protection 
  • Revamped dashboards for simplified data visualization 
  • Genie, a powerful new tool designed to bring insights to non-technical users 

 “This kind of recognition makes our company shine brighter in the Databricks partnership, showcasing the internal talent and our amazing capabilities to take on challenges.”  — Juan Cardona Ramirez, Technical Consultant 

 

More About Our Partnership

At Perficient, we are proud to be a trusted Databricks consulting partner with almost 200 Databricks-certified consultants. Our team specializes in delivering tailored data engineering, analytics, and AI solutions that unlock value and drive business transformation. 

Learn more about our Databricks partnership here

]]>
https://blogs.perficient.com/2025/06/04/latams-first-databricks-champion-at-perficient/feed/ 0 382370
Perficient Achieves Databricks Elite Partner Status https://blogs.perficient.com/2025/03/17/perficient-achieves-databricks-elite-partner-status/ https://blogs.perficient.com/2025/03/17/perficient-achieves-databricks-elite-partner-status/#comments Mon, 17 Mar 2025 17:18:45 +0000 https://blogs.perficient.com/?p=378385

Perficient is excited to announce that we have achieved Elite level partner status within the Databricks partner network. This top-tier recognition highlights the outstanding efforts of our talented Databricks strategists, architects, industry leaders, and engineers. It’s a testament to our data and AI expertise and unwavering commitment to delivering transformative results for our clients.

Perficient’s Clinical Trial Data Collaboration Solution, developed in partnership with a top-five life sciences leader, is a great example of how we bring fresh ideas and game-changing transformative business outcomes for our client. Our tailored solution optimizes the integration and review of increasingly complex and ever-changing clinical trial data. Built to maximize modern, cloud-based architecture powered by Databricks, this robust interface equips clinical operations teams to rapidly review, analyze, and discuss all the data for clinical trials, driving faster drug development through heightened collaboration and decision-making.

Built-in Machine Learning and GenAI capabilities further boost the speed and quality of collaboration, proactively identifying outliers, patterns, and signals to enable the clinical team to make crucial decisions faster. Our solution leverages Databricks as the core for data engineering, AI/ML, and SQL capabilities, alongside a custom-built user interface and a suite of AWS services to support the platform.

“As more organizations look to build data intelligence, we are strengthening our partnership with Perficient to help companies take control of their data and put it to work with AI,” said Jason McIntyre, VP, Elite Partner Portfolios, Databricks. “Databricks is dedicated to building a strong partner ecosystem, and with Perficient’s data and AI expertise, we’re delivering exceptional solutions that address key pain points and drive growth and innovation across industries.”

The growth and depth of our Databricks practice is a true testament to our delivery teams. They have a real drive to dive deep into the platform and stay up to date on the continuous developments and features coming out of the Databricks team. More importantly, they are out there finding ways to help our customers streamline their data teams and drive industrychanging solutions by leveraging their data in new and innovative ways,” said Nick Passero, Databricks practice director.

From new deployments, migrations and builds, to advisory services, Perficient looks forward to continued growth and working closely with Databricks in 2025 and beyond. Contact us to learn more about our Databricks and healthcare & life sciences expertise and capabilities, and how we can help you transform your business.

]]>
https://blogs.perficient.com/2025/03/17/perficient-achieves-databricks-elite-partner-status/feed/ 1 378385
Accelerate the Replication of Oracle Fusion Cloud Apps Data into Databricks https://blogs.perficient.com/2025/02/25/databricks-accelerator-for-oracle-fusion-applications/ https://blogs.perficient.com/2025/02/25/databricks-accelerator-for-oracle-fusion-applications/#respond Tue, 25 Feb 2025 15:34:08 +0000 https://blogs.perficient.com/?p=377771

Following up on my previous post which highlights different approaches of accessing Oracle Fusion Cloud Apps Data from Databricks, I present in this post details of Approach D, which leverages the Perficient accelerator solution. And this accelerator applies to all Oracle Fusion Cloud applications: ERP, SCM, HCM and CX.

As demonstrated in the previous post, the Perficient accelerator differs from the other approaches in that it has minimal requirements for additional cloud platform services. The other approaches of extracting data efficiently and in a scalable manner require the deployment of additional cloud services such as data integration/replication services and an intermediary data warehouse. With the Perficient accelerator, however, replication is driven by techniques that are solely reliant on native Oracle Fusion and Databricks. The accelerator consists of a Databricks workflow with configurable tasks to handle the end-to-end process of managing data replication from Oracle Fusion into the silver layer of Databricks tables. When deploying the solution, you get access to all underlying python/SQL notebooks that can be further customized based on your needs.

Why consider deploying the Perficient Accelerator?

There are several benefits to deploying this accelerator as opposed to building data replications from Oracle Fusion from the ground up. Built with automation, the solution is future-proof and enables scalability to accommodate evolving data requirements with ease. The diagram below highlights key considerations.

Databricks Accelerator Solution Benefits

A Closer Look at How Its Done

In the Oracle Cloud: The Perficient solution leverages Oracle BI Cloud Connector (BICC) which is the preferred method of extracting data in bulk from Oracle Fusion while minimizing the impact to the Fusion application itself. Extracted data and metadata is temporarily made available in the OCI Object Storage buckets for downstream processing. Archival of exported data on the OCI (Oracle Cloud Infrastructure) side is also automatically handled, if required, with purging rules.

Oracle Fusion To Databricks Data Replication Architecture

In the Databricks hosting cloud:

  • Hosted in one of: AWS, Azure or GCP, the accelerator’s workflow job and notebooks are deployed in the Databricks workspace. The Databricks delta tables schema, configuration and log files are all hosted within the Databricks Unity Catalog.
  • Notebooks leverage parametrized code to programmatically determine which Fusion view objects get replicated through the silver tables.
  • The Databricks workflow triggers the data extraction from Oracle Fusion BICC based on a predefined Fusion BICC job. The BICC job determines which objects get extracted.
  • Files are then transferred over from OCI to a landing zone object store in the cloud that hosts Databricks.
  • Databricks AutoLoader handles the ingestion of data into bronze Live Tables which store historical insert, update and delete operations relevant to the extracted objects.
  • Databricks silver Live Tables are then loaded from bronze via a Databricks managed DLT Pipeline. The silver tables are de-duped and represent the same level of data granularity for each Fusion view object as it exists in Fusion.
  • Incremental table refreshes are set up automatically leveraging Oracle Fusion object metadata that enables incremental data merges within Databricks. This includes inferring any data deletion from Oracle Fusion and processing deletions through to the silver tables.

Whether starting small with a few tables or looking to easily scale to hundreds and thousands of tables, the Perficient Databricks accelerator for Oracle Fusion data handles the end-to-end workflow orchestration. As a result, you end up spending less time with data integration and focus efforts on business facing analytical data models.

For assistance with enabling data integration between Oracle Fusion Applications and Databricks, reach out to mazen.manasseh@perficient.com.

]]>
https://blogs.perficient.com/2025/02/25/databricks-accelerator-for-oracle-fusion-applications/feed/ 0 377771
How to Access Oracle Fusion Cloud Apps Data from Databricks https://blogs.perficient.com/2025/02/24/databricks-for-oracle-fusion-cloud-applications/ https://blogs.perficient.com/2025/02/24/databricks-for-oracle-fusion-cloud-applications/#respond Mon, 24 Feb 2025 15:00:09 +0000 https://blogs.perficient.com/?p=377620

Connecting to Oracle Fusion Cloud Applications data from external non-Oracle systems, like Databricks, is not feasible for bulk data operations via a direct connection. However, there are several approaches to making Oracle apps data available for consumption from Databricks. What makes this task less straightforward is the fact that Oracle Fusion Cloud Applications and Databricks exist in separate clouds. Oracle Fusion apps (ERP, SCM, HCM, CX) are hosted on Oracle Cloud while Databricks leverages one of AWS, Azure or Google Cloud. Nevertheless, there are several approaches that I will present in this blog on how to access Oracle Apps data from Databricks.

While there are other means of performing this integration than what I present in this post, I will be focusing on:

  1. Methods that don’t require 3rd party tools: The focus here is on Oracle and Databricks technologies or Cloud services.
  2. Methods that scale to large number of objects and high data volumes: While there are additional means of Fusion data extraction such as using REST APIs, OTBI, or BI Publisher, these are not recommended methods for handling large bulk data extracts from Oracle Fusion and are therefore not part of this analysis. One or more of these techniques may still be applied though, when necessary, and may co-exist with the approaches discussed in this blog.

The following diagrams summarize four different approaches on how to replicate Oracle Fusion Apps data in Databricks. Each diagram highlights the data flow, and the technologies applied.

  • Approach A: Leverages Oracle Autonomous Data Warehouse and an Oracle GoldenGate Replication Deployment
  • Approach B: Leverages Oracle Autonomous Data Warehouse and the standard Delta Sharing protocol
  • Approach C: Leverages Oracle Autonomous Data Warehouse and a direct JDBC connection from Databricks.
  • Approach D: Leverages a Perficient accelerator solution using Databricks AutoLoader and DLT Pipelines. More information is available on this approach here.

Oracle Fusion data flow to Databricks with Oracle Autonomous DW and GoldGateOracle Fusion data flow to Databricks with Oracle Autonomous DW and Delta Sharing Oracle Fusion data flow to Databricks with Oracle Autonomous DW and JDBCOracle Fusion data flow to Databricks with Perficient Accelerator Solution

Choosing the right approach for your use case is dependent on the objective of performing this integration and the ecosystem of cloud platforms that are applicable to your organization. For guidance on this, you may reach Perficient by leaving a comment in the form below. Our Oracle and Databricks specialists will connect with you and provide recommendations.

]]>
https://blogs.perficient.com/2025/02/24/databricks-for-oracle-fusion-cloud-applications/feed/ 0 377620