Real-Time Insights Articles / Blogs / Perficient

SAP and Databricks: Better Together

David Callaghan — Thu, 13 Feb 2025 22:49:26 +0000

SAP Databricks is important because convenient access to governed data to support business initiatives is important. Breaking down silos has been a drumbeat of data professionals since Hadoop, but this SAP <-> Databricks initiative may help to solve one of the more intractable data engineering problems out there. SAP has a large, critical data footprint in many large enterprises. However, SAP has an opaque data model. There was always a long painful process to do the glue work required to move the data while recognizing no real value was being realized in that intermediate process. This caused a lot of projects to be delayed, fail, or not pursued resulting in a pretty significant lost opportunity cost for the client and a potential loss of trust or confidence in the system integrator. SAP recognized this and partnered with a small handful of companies to enhance and enlarge the scope of their offering. Databricks was selected to deliver bi-directional integration with their Databricks Lakehouse platform. When I heard there was going to be a big announcement, I thought we were going to hear about a new Lakehouse Federation Connector. That would have been great; I’m a fan.

This was bigger.

Technical details are still emerging, so I’m going to try to focus on what I heard and what I think I know. I’m also going to hit on some use cases that we’ve worked on that I think could be directly impacted by this today. I think the most important takeaway for data engineers is that you can now combine SAP with your Lakehouse without pipelines. In both directions. With governance. This is big.

SAP Business Data Cloud

I don’t know much about SAP, so you can definitely learn more here. I want to understand more about the architecture from a Databricks perspective and I was able to find out some information from the Introducing SAP Databricks post on the internal Databricks blog page.

This is when it really sunk in that we were not dealing with a new Lakeflow Connector;

SAP Databricks is a native component in the SAP Business Data Cloud and will be sold by SAP as part of their SAP Business Data Cloud offering. It’s not in the diagram here, but you can actually integrate new or existing Databricks instances with SAP Databricks. I don’t want to get ahead of myself, but I would definitely consider putting that other instance of Databricks on another hyperscaler.

In my mind, the magic is the dotted line from the blue “Curated context-rich SAP data products” up through the Databricks stack.

Open Source Sharing

The promise of SAP Databricks is the ability to easily combine SAP data with the rest of the enterprise data. In my mind, easily means no pipelines that touch SAP. The diagram we see with the integration point between SAP and Databricks SAP uses Delta Sharing as the underlying enablement technology.

Delta Sharing is an open-source protocol, developed by Databricks and the Linux Foundation, that provides strong governance and security for sharing data, analytics and AI across internal business units, cloud providers and applications. Data remains in its original location with Delta Sharing: you are sharing live data with no replication. Delta Share, in combination with Unity Catalog, allows a provider to grant access to one or more recipients and dictate what data can be seen by those shares using row and column-level security.

Open Source Governance

Databricks leverages Unity Catalog for security and governance across the platform including Delta Share. Unity Catalog offers strong authentication, asset-level access control and secure credential vending to provide a single, unified, open solution for protecting both (semi- & un-)structured data and AI assets. Unity Catalog offers a comprehensive solution for enhancing data governance, operational efficiency, and technological performance. By centralizing metadata management, access controls, and data lineage tracking, it simplifies compliance, reduces complexity, and improves query performance across diverse data environments. The seamless integration with Delta Lake unlocks advanced technical features like predictive optimization, leading to faster data access and cost savings. Unity Catalog plays a crucial role in machine learning and AI by providing centralized data governance and secure access to consistent, high-quality datasets, enabling data scientists to efficiently manage and access the data they need while ensuring compliance and data integrity throughout the model development lifecycle.

Data Warehousing

Databricks is now a first-class Data Warehouse with its Databricks SQL offering. The serverless SQL warehouses have been kind of a game changer for me because they spin up immediately and size elastically. Pro tip: now is a great time to come up with a tagging strategy. You’ll be able to easily connect your BI tool (Tableau, Power BI, etc) to the warehouse for reporting. There are also a lot of really useful AI/BI opportunities available natively now. If you remember in the introduction, I said that I would have been happy had this only been a Lakehouse Federation offering. You still have the ability to take advantage of Federation to discover, query and govern data from Snowflake, Redshift, Salesforce, Teradata and many others all from within a Databricks instance. I’m still wrapping my head around being able to query Salesforce and SAP Data in a notebook inside Databricks inside SAP.

Mosaic AI + Joule

As a data engineer, I was the most excited about zero-copy, bi-directional SAP data flow into Databricks. This is selfish because it solves my problems, but its relatively short-sighted. The integration between SAP and Databricks will likely deliver the most value through Agentic AI. Lets stipulate that I believe that chat is not the future of GenAI. This is not a bold statement; most people agree with me. Assistants like co-pilots represented a strong path forward. SAP thought so, hence Joule. It appears that SAP is leveraging the Databricks platform in general and MosaicAI in particular to provide a next generation of Joule which will be an AI copilot infused with agents.

Conclusion

The integration of SAP and the Databricks Lakehouse represents a transformative approach to enterprise data management. By uniting the strengths of SAP’s end-to-end process management and semantically rich data with the advanced analytics and scalability of a lakehouse architecture, organizations can drive better decisions, foster innovation, and simplify their data landscapes. Whether it’s unifying SAP and non-SAP data, enabling real-time insights, or scaling AI initiatives, this partnership provides a roadmap for the future of data-driven enterprises.

SAP and Databricks: Better Together

David Callaghan — Sun, 17 Nov 2024 23:07:21 +0000

This was bigger.

SAP Business Data Cloud

This is when it really sunk in that we were not dealing with a new Lakeflow Connector;

In my mind, the magic is the dotted line from the blue “Curated context-rich SAP data products” up through the Databricks stack.

Open Source Sharing

Delta Sharing is an open-source protocol, developed by Databricks and the Linux Foundation, that provides strong governance and security for sharing data, analytics and AI across internal business units, clouds providers and applications. Data remains in its original location with Delta Sharing: you are sharing live data with no replication. Delta Share, in combination with Unity Catalog, allows a provider to grant access to one or more recipients and dictate what data can be seen by those shares using row and column-level security.

Open Source Governance

Data Warehousing

Databricks is now a first-class Data Warehouse with its Databricks SQL offering. The serverless SQL warehouses have been kind of a game changer for me because they spin up immediately and size elastically. Pro tip: now is a great time to come up with a tagging strategy. You’ll be able to easily connect your BI tool (Tableau, PowerBI, etc) to the warehouse for reporting. There are also a lot of really useful AI/BI opportunities available natively now. If you remember in the introduction, I said that I would have been happy had this only been a Lakehouse Federation offering. You still have the ability to take advantage of Federation to discover, query and govern data from Snowflake, Redshift, Salesforce, Teradata and many others all from within a Databricks instance. I’m still wrapping my head around being able to query Salesforce and SAP Data in a notebook inside Databricks inside SAP.

Mosaic AI + Joule

Conclusion

Accelerate and Scale your Event Driven Architecture with GridGain

David Callaghan — Thu, 16 Mar 2023 21:03:59 +0000

Are you looking for a way to accelerate and scale your Event Driven Architecture in the cloud? GridGain is here to help. GridGain, built on top of Apache Ignite, is a comprehensive in-memory computing platform that provides distributed caching, messaging, and compute capabilities, with enterprise-grade support. With its performance capabilities, it can increase the overall responsiveness of your EDA application, allowing you to build applications that respond quickly to changing conditions. This will enable database engineers, solution architects, and developers alike gain greater control over their system’s uptime while eliminating wasted resources due to inefficient data processing. In this blog post we will be exploring these powerful performance features of GridGain as well as how they allow us build better apps faster than ever before.

Introducing GridGain Performance for Event Driven Architectures

GridGain Performance is the ideal solution for cloud-based applications requiring event-driven architectures. With a powerful combination of ACID transactions, integration capabilities and cloud scalability, GridGain Performance provides the best in cloud development for quickly adapting to changing conditions. GridGain is designed to scale horizontally across multiple nodes, enabling it to handle large amounts of data and processing tasks. Its shared-nothing architecture allows it to scale out and achieve high levels of parallelism. GridGain provides additional performance features such as data locality, collocated processing, and optimized messaging. This makes GridGain ideal for cloud-based event-driven architecture as it ensures low-latency processing and high throughput, even with a large number of concurrent users.

How GridGain Performance Improves EDA Scalability

By utilizing share-nothing architectures and GridGain’s high-performance infrastructure, organizations can achieve scalability at a fraction of the time most traditional event-driven architectures require. Share-nothing design allows different flows to share resources across different projects while GridGain’s physics-based fabric allows users to share data across thousands of nodes quickly and efficiently. Unlocking scalability via share nothing architecture combined with GridGain’s performance brings businesses closer to achieving fast, real-time insights with minimal cost and operations overhead in cloud deployments.

Key Features of GridGain Performance for EDA

GridGain provides a flexible data model that supports key-value, SQL, and compute-based processing. GridGain provides multiple persistence options such as in-memory, native persistence, and database integration. This makes it easy to store and retrieve data, even after a system restart or failure. GridGain also supports ACID transactions, ensuring data consistency and integrity. GridGain provides integrations with popular databases such as Oracle, SQL Server, and MySQL, as well as with other Apache projects such as Hadoop, Spark, and Cassandra. This makes it easy to integrate GridGain with existing systems and workflows, enabling a seamless transition to a cloud-based event-driven architecture.

Alternatives to GridGain

There are a number of popular alternatives for an Event Driven Architecture backend on the market.

Redis is an in-memory data stores that can be used for caching, messaging, and real-time processing. It also is a single-node, in-memory data store that can be used in a hive-drones or cluster architecture, while GridGain is distributed in-memory computing platform that provides a shared-nothing architecture with distributed data structures, compute, and messaging capabilities. Redis provides a simple key-value data model while GridGain provides a flexible data model that supports key-value, SQL, and compute-based processing. GridGain provides both in-memory and persistent storage options with support for SQL and ACID transactions. Redis does not provide support for SQL or ACID transactions.

Memcached is a distributed memory caching system that uses a client-server architecture. GridGain provides distributed data structures, compute, and messaging capabilities, with the ability to scale horizontally across multiple nodes using a shared-nothing architecture. Memcached provides a simple key-value data model, with no support for data partitioning, querying or indexing. GridGain provides a flexible data model that supports key-value, SQL, and compute-based processing. Memcached does not provide built-in persistence options. GridGain provides both in-memory and persistent storage options.

GridGain and Hazelcast are both in-memory computing platforms that provide distributed caching, messaging, and compute capabilities. Hazelcast is also a distributed in-memory computing platform, but unlike GridGain’s share-nothing architecture, it provides a shared-data architecture with a hive-drones or cluster configuration. Hazelcast provides support for persistence with options such as Hot Restart, but it does not provide support for SQL or ACID transactions.

GridGain is built on top of the open source Apache Ignite project. GridGain provides additional enterprise features and support on top of Apache Ignite, including automatic rebalancing and partitioning, additional data structures, more persistence options, optimized messaging, and additional integrations.

Summary

GridGain, built on top of Apache Ignite, is a comprehensive in-memory computing platform that provides distributed caching, messaging, and compute capabilities, with enterprise-grade support. GridGain is designed to scale horizontally across multiple nodes, enabling it to handle large amounts of data and processing tasks. Its shared-nothing architecture allows it to scale out and achieve high levels of parallelism.

GridGain provides additional performance features such as data locality, collocated processing, and optimized messaging. It also provides a flexible data model that supports key-value, SQL, and compute-based processing enabling a wide range of use cases such as caching, messaging, and distributed computing. GridGain provides multiple persistence options such as in-memory, native persistence, and database integration. This makes it easy to store and retrieve data, even after a system restart or failure. GridGain also supports ACID transactions, ensuring data consistency and integrity.

Finally, GridGain provides integrations with popular databases such as Oracle, SQL Server, and MySQL, as well as with other Apache projects such as Hadoop, Spark, and Cassandra. This makes it easy to integrate GridGain with existing systems and workflows, enabling a seamless transition to a cloud-based event-driven architecture.