data replication Articles / Blogs / Perficient

Salesforce to Databricks: A Deep Dive into Integration Strategies

Mazen Manasseh — Tue, 15 Jul 2025 15:18:29 +0000

Supplementing Salesforce with Databricks as an enterprise Lakehouse solution brings advantages for various personas across an organization. Customer experience data is highly valued when it comes to driving personalized customer journeys leveraging company-wide applications beyond Salesforce. From enhanced customer satisfaction to tailored engagements and offerings that drive business renewals and expansions, the advantages are hard to miss. Databricks maps data from a variety of enterprise apps, including those used by Sales, Marketing and Finance. Consequently, layering Databricks Generative AI and predictive ML capabilities provide easily accessible best-fit recommendations that help eliminate challenges and highlight success areas within your company’s customer base.

In this blog, I elaborate on the different methods whereby Salesforce data is made accessible from within Databricks. While accessing Databricks data from Salesforce is possible, it is not the topic of this post and will perhaps be tackled in a later blog. I have focused on the built-in capabilities within both Salesforce and Databricks and have therefore excluded 3^rd party data integration platforms. There are three main ways to achieve this integration:

Databricks Lakeflow Ingestion from Salesforce
Databricks Query Federation from Salesforce Data Cloud
Databricks Files Sharing from Salesforce Data Cloud

Choosing the best approach to use depends on your use case. The decision is driven by several factors, such as the expected latency of accessing the latest Salesforce data, the complexity of the data transformations needed, and the volume of Salesforce data of interest. And it may very well be that more than one method is implemented to cater for different requirements.

While the first method copies the raw Salesforce data over to Databricks, methods 2 and 3 offer no-copy alternatives, thus leveraging Salesforce Data Cloud itself as the raw data layer. The no-copy alternatives are great in that they leverage Salesforce’s native capability of managing its own data lake thus eliminating overhead by redoing that effort. However, there are limitations to doing that, depending on the use case. The matrix below presents how each method compares when factoring in the key criteria for integration.

Method	Lakeflow Ingestion	Salesforce Data Cloud Query Federation	Salesforce Data Cloud File Sharing
Type	Data Ingestion	Zero-Copy	Zero-Copy
Supports Salesforce Data Cloud as a Source?	︎ Yes	︎ Yes	︎ Yes
Incremental Data Refreshes	︎ Automated processing into Databricks based on SF standard timestamp fields. Formula fields always require a full refresh of the formulas.	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)
Processing of Soft Deletes	︎ Yes Supported incrementally	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)
Processing of Hard Deletes	✘ Requires a full refresh	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)	︎ Automated in SF Data Cloud (Requires custom handling if copying to Databricks)
Query Response Time	︎ Best as data is queried from a local copy and processed within Databricks	Slower as query response is dependent on SF Data Cloud, and data has to travel across networks	Slower as data travels across networks
Supports Real-Time Querying?	✘ No The pipeline runs on a schedule to copy data for example, hourly, daily, etc.	︎ Yes Live query execution on SF Data Cloud (Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)	︎ Yes Live data sourced from SF Data Cloud (Data Cloud DLO is refreshed from Salesforce modules either in batches, streaming (every 3 min), or in real-time.)
Supports Databricks Streaming Pipelines?	︎ Yes, With Declarative Pipelines into Streaming tables (DLT) (runs as micro-batch jobs)	✘ No	✘ No
Suitable for High Data Volume?	︎ Yes SF Bulk API is called for high data volumes such as initial loads, and SF REST API is used for lower data volumes such as limited data volume incremental loads.	✘ No Reliant on JDBC Query Pushdown limitations and SF performance	Moderate This method is more suitable than Query Federation when it comes to zero-copy with high volumes of data.
Supports Data Transformation	No direct transformation. Ingests SF objects as is. Transformation happens downstream in the Declarative Pipeline.	︎ Yes. DBRX pushes queries over to Salesforce using JDBC protocol.	︎ Yes. Transformations execute on Databricks compute
Protocol	SF REST API and Bulk API over HTTPS	JDBC over HTTPS	Salesforce Data Cloud DaaS APIs over HTTPS (file-based access)
Scalability	Up to 250 objects per pipeline. Multiple pipelines are allowed.	Depending on SF Data Cloud performance when running transformation with multiple objects	Up to 250 Data Cloud objects may be included in a data share. Up to 10 data shares.
Salesforce Prerequisites	API-enabled Salesforce user with access to desired objects	Salesforce Data Cloud must be available. Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population. Enable JDBC API access to Data Cloud.	Salesforce Data Cloud must be available. Data Cloud DMOs mapped to DLOs with Streams or other methods for Data Lake population. Data share target is created in SF with shared objects.

If you’re looking for guidance on leveraging Databricks with Salesforce, reach out to Perficient for a discussion with Salesforce and Databricks specialists.

Accelerate the Replication of Oracle Fusion Cloud Apps Data into Databricks

Mazen Manasseh — Tue, 25 Feb 2025 15:34:08 +0000

Following up on my previous post which highlights different approaches of accessing Oracle Fusion Cloud Apps Data from Databricks, I present in this post details of Approach D, which leverages the Perficient accelerator solution. And this accelerator applies to all Oracle Fusion Cloud applications: ERP, SCM, HCM and CX.

As demonstrated in the previous post, the Perficient accelerator differs from the other approaches in that it has minimal requirements for additional cloud platform services. The other approaches of extracting data efficiently and in a scalable manner require the deployment of additional cloud services such as data integration/replication services and an intermediary data warehouse. With the Perficient accelerator, however, replication is driven by techniques that are solely reliant on native Oracle Fusion and Databricks. The accelerator consists of a Databricks workflow with configurable tasks to handle the end-to-end process of managing data replication from Oracle Fusion into the silver layer of Databricks tables. When deploying the solution, you get access to all underlying python/SQL notebooks that can be further customized based on your needs.

Why consider deploying the Perficient Accelerator?

There are several benefits to deploying this accelerator as opposed to building data replications from Oracle Fusion from the ground up. Built with automation, the solution is future-proof and enables scalability to accommodate evolving data requirements with ease. The diagram below highlights key considerations.

A Closer Look at How Its Done

In the Oracle Cloud: The Perficient solution leverages Oracle BI Cloud Connector (BICC) which is the preferred method of extracting data in bulk from Oracle Fusion while minimizing the impact to the Fusion application itself. Extracted data and metadata is temporarily made available in the OCI Object Storage buckets for downstream processing. Archival of exported data on the OCI (Oracle Cloud Infrastructure) side is also automatically handled, if required, with purging rules.

In the Databricks hosting cloud:

Hosted in one of: AWS, Azure or GCP, the accelerator’s workflow job and notebooks are deployed in the Databricks workspace. The Databricks delta tables schema, configuration and log files are all hosted within the Databricks Unity Catalog.
Notebooks leverage parametrized code to programmatically determine which Fusion view objects get replicated through the silver tables.
The Databricks workflow triggers the data extraction from Oracle Fusion BICC based on a predefined Fusion BICC job. The BICC job determines which objects get extracted.
Files are then transferred over from OCI to a landing zone object store in the cloud that hosts Databricks.
Databricks AutoLoader handles the ingestion of data into bronze Live Tables which store historical insert, update and delete operations relevant to the extracted objects.
Databricks silver Live Tables are then loaded from bronze via a Databricks managed DLT Pipeline. The silver tables are de-duped and represent the same level of data granularity for each Fusion view object as it exists in Fusion.
Incremental table refreshes are set up automatically leveraging Oracle Fusion object metadata that enables incremental data merges within Databricks. This includes inferring any data deletion from Oracle Fusion and processing deletions through to the silver tables.

Whether starting small with a few tables or looking to easily scale to hundreds and thousands of tables, the Perficient Databricks accelerator for Oracle Fusion data handles the end-to-end workflow orchestration. As a result, you end up spending less time with data integration and focus efforts on business facing analytical data models.

For assistance with enabling data integration between Oracle Fusion Applications and Databricks, reach out to mazen.manasseh@perficient.com.

How to Access Oracle Fusion Cloud Apps Data from Databricks

Mazen Manasseh — Mon, 24 Feb 2025 15:00:09 +0000

Connecting to Oracle Fusion Cloud Applications data from external non-Oracle systems, like Databricks, is not feasible for bulk data operations via a direct connection. However, there are several approaches to making Oracle apps data available for consumption from Databricks. What makes this task less straightforward is the fact that Oracle Fusion Cloud Applications and Databricks exist in separate clouds. Oracle Fusion apps (ERP, SCM, HCM, CX) are hosted on Oracle Cloud while Databricks leverages one of AWS, Azure or Google Cloud. Nevertheless, there are several approaches that I will present in this blog on how to access Oracle Apps data from Databricks.

While there are other means of performing this integration than what I present in this post, I will be focusing on:

Methods that don’t require 3^rd party tools: The focus here is on Oracle and Databricks technologies or Cloud services.
Methods that scale to large number of objects and high data volumes: While there are additional means of Fusion data extraction such as using REST APIs, OTBI, or BI Publisher, these are not recommended methods for handling large bulk data extracts from Oracle Fusion and are therefore not part of this analysis. One or more of these techniques may still be applied though, when necessary, and may co-exist with the approaches discussed in this blog.

The following diagrams summarize four different approaches on how to replicate Oracle Fusion Apps data in Databricks. Each diagram highlights the data flow, and the technologies applied.

Approach A: Leverages Oracle Autonomous Data Warehouse and an Oracle GoldenGate Replication Deployment
Approach B: Leverages Oracle Autonomous Data Warehouse and the standard Delta Sharing protocol
Approach C: Leverages Oracle Autonomous Data Warehouse and a direct JDBC connection from Databricks.
Approach D: Leverages a Perficient accelerator solution using Databricks AutoLoader and DLT Pipelines. More information is available on this approach here.

Choosing the right approach for your use case is dependent on the objective of performing this integration and the ecosystem of cloud platforms that are applicable to your organization. For guidance on this, you may reach Perficient by leaving a comment in the form below. Our Oracle and Databricks specialists will connect with you and provide recommendations.