Skip to main content

Data & Intelligence

Using Snowflake and Databricks Together

Using Snowflake and Databricks together

This is not another comparison between Databricks and Snowflake; they’re not hard to find. This is a practical guide about using Databricks and Snowflake together in your organization. Many companies have both products implemented. Sometimes, there is a discrepancy between the two as far as the data being stored, creating new data silos. The Databricks users would like to access data that is only in Snowflake and vice versa. Sometimes the issues are around optimization; would running the process in Snowflake as opposed to Databricks be cheaper or faster and vice versa. Sometimes its both. Regardless, there are a lot of organizations that have both Snowflake and Databricks and it worthwhile to see how they can be used together.

Look Under the Hood

There are different tactical approaches to using Snowflake and Databricks together. While either direction is possible, the basic implementation details are very different. We will also discuss access managed versus external data. Both Snowflake and Databricks leverage cloud providers for storage. Both products have a concept of managed (internal) or unmanaged (external) tables. A managed resource is a file stored in the cloud whose lifecycle is managed by Snowflake or Databricks while an external file’s lifecycle is not fully controlled by either product. In other words, if you DROP a managed table; the metadata and underlying data are deleted. If you DROP an unmanaged table, only the metadata is affected. External tables are usually considered to be read-only but interesting exceptions have emerged.

Access Snowflake Data from Databricks

Databricks 11.3+ has a Snowflake connector. You specify Snowflake as the format when you read or write from Snowflake: spark.read/write.format("snowflake"). Both the Databricks and the Snowflake documentation stress the importance of storing the Snowflake credentials in Databricks secrets. The documentation page has a simple notebook example that I recommend you try first just to test connectivity. Its also a good idea to look into Lakehouse Federation to query Snowflake data (among others) without having to move the data into Databricks. Snowflake has a Secure Sharing feature and they are starting to open that access out to other vendors (like Salesforce) but not Databricks at this time.

Another way to read Snowflake tables in Databricks is to read external tables. External tables in Snowflake have traditionally been read-only resources. That is a different story with Apache Iceberg tables that use the Apache Parquet file format. You can configure an external volume for Iceberg tables and then provide access in Snowflake for the external Databricks user by giving them read-only access to a new IAM role. Next, create a new cluster in Databricks and provide the Spark Config with the necessary connection information the Snowflake Catalog needs, plus the environmental variables that include the keys needed to access the cloud provider’s storage layer and add the Iceberg spark runtimes and Snowflake JDBC driver to the driver library. There does not need to be a warehouse on Snowflake for Databricks; all the compute will occur in the Datbricks Spark cluster.

Access Databricks Data from Snowflake

Snowflake does not directly connect to Databricks; it uses an external stage to access external, or unmanaged, Databricks tables. Both Snowflake and Databricks are cloud-native and both make use of the storage options for AWS, Azure and GCP. When you create an external table in Databricks, you are combining a storage path with a storage credential. Databricks manages both the metadata and data for a managed table while it only manages the metadata for external tables.  If you are considering this approach, its a good idea to understand how Snowflake can fully integrate with Delta Lake.

Conclusion

Databricks can read/write to Snowflake managed tables, query Snowflake’s managed and external tables using Lakehouse Federation or read external tables and even read/write external Iceberg tables. Snowflake natively can only read Databricks data in the sense that both Databricks and Snowflake can read data from cloud storage. Snowflake can also read data in Databricks made available through Delta Share.

Enterprise Considerations

I touched briefly on how each product can communicate with the other just to show the guardrails around what is possible. We have a lot more tools from Databricks for data sharing just from a bake-off perspective. However, we need to look at more than just a feature matrix to see how data a can be more efficiently and effectively used. For that, we need to evaluate these options from a security, governance and cost perspective.

Security

Early on in this post, I recommended trying the sample Databricks notebook to see if you could access Snowflake. I did this for a reason; I’ll bet it was hard. It’s not technically difficult, however there is a lot to unpack here. Snowflake’s RBAC model provides for finer-grained access control within the database while Databricks focuses on workspace and cluster-level security. Snowflake manages the data encryption and secure data access while Databricks leverages the cloud provider. I pointed out that Databricks uses a secrets management approach while Snowflake uses encryption key management for data at rest. Databricks uses UDFs and dynamic views to support row filters and column level masking, which would not be available to a Snowflake instance accessing the external table. Snowflake’s Enterprise Edition provides for row-level security through access policies which I think should work with the Databricks connector. Column-level security in Snowflake, which could involve dynamic data masking or external tokenization, would also require thorough testing.

Governance

Databricks relies on Unity Catalog to provide a unified solution for all data and AI assets including data lineage and access controls. Delta Lake provides ACID transactions and data versioning. Snowflake provides for time travel for auditing and data recovery, object tagging and data classification for compliance and privacy, data masking and row-level security and granular data access it various levels of data storage. Basically, governance in Databricks is focused on governing data processing and analytics while Snowflake concentrates on data storage and data access. Both take valid, strong approaches to governance but their approaches are not the same. This doesn’t mean that there will be discrepancies; just that its more likely there could be.

Cost

There are also a lot of articles around which is cheaper. This is not one of those articles. Objectively, they both have different cost structures. Snowflake separated its compute and storage costs. Storage is calculated per TB, per month and includes active and historical data (time-travel and fail-safe data). Compute is based on Snowflake Credits, which is a time-based, consumption model calculated based on the type and size of the compute warehouses used.  Databricks uses Databricks Units (DBUs) to represent a unit of processing capability per hour and charges based on the type and number of nodes in a cluster. You pay for running clusters. Trying to compare the relative dollar value of a Snowflake Credit or Databricks unit is not very helpful. You can make some basic assumptions.

Heavy processing workflows involving complex data transformations can sometimes be more cost effective in Databricks than in Snowflake since these operations can consume more compute resources. However, if you have a lot of transformation that would need to be run on a Spark executor node, Snowflake might have an advantage. Other workloads that require a lot of compute time, like real time analytics and high-concurrency workloads, are usually cheaper on Databricks. Simple data warehouse operations as well as small or infrequent data processing tasks will likely be cheaper on Snowflake. Despite strong marketing efforts on both sides, Snowflake is usually cheaper for performing data warehouse work and Databricks is usually cheaper for analytics and data processing

Sample Options

We have briefly touched on some of the implementation details and practical enterprise considerations around using Snowflake and Databricks together. Its usually helpful to consider Databricks as a unified data analytics platform and Snowflake as a data warehouse. More simply, Databricks is good for data transformation and used by the kinds of people who like Python an Jupyter Notebooks. Snowflake is preferred by those who like reporting tools like Talend and SQL. Both companies have marketing departments that would take offense to this characterization, but there we go. This does give us a good foundation to start visualizing a data strategy where we break down the silos that are being formed around the products we bought to break down data silos.

Cloud Storage layer

We’ve seen that both can read files from a cloud data storage. Each of them would prefer that you manage those files within their system, but both of these production have consumption based pricing models. It is not unreasonable, for example, to have the raw data stored in cloud storage, like S3. AWS could manage the lifecycle, versioning, archiving and deliberate deletion of these files. If they are not managed by any system that could execute a DROP TABLE, you can basically have an immutable Bronze layer (to borrow from the medallion architecture). Technically, the data is shared among, but not between, Databricks and Snowflake. This approach gives us some blunt governance and security tools, but they have the advantage of being foundational and shared.

Bronze layer

It may be because of their open-source roots, but Databricks has the advantage on sharing and integration. If the source of data cannot be entirely on cloud storage, it makes sense that Databricks be the source of data. The rationale is that Databricks is more efficient for most ETL jobs. While there are exceptions, they are just exceptions. I do think that external tables make sense for Databricks at the Bronze layer, I don’t think that’s as strong of an argument when we get to Silver and Gold.

Gold layer

The Gold layer is typically aggregated datasets that are mostly used for reporting. Databricks has a SQL Warehouse, but Snowflake is still the clear winner for SQL-based data warehousing use cases. Using Delta Share to make this data available in Snowflake is usually a pretty safe strategy. Aggregation usually de-identifies sensitive information so there isn’t a need for the kind of fine-grained data control the Snowflake has and Databricks can’t provide through Delta Share. As a matter of fact, Snowflake is so much better in this area that you may find it a common practice for databricks to be reading these Gold layer tables from Snowflake. Which brings us to the Silver layer.

Silver layer

The Silver layer is typically used for analytics, machine learning and AI. Databricks has the early adopter lead in this space but GenAI has made everyone pick up their game in this space and Snowflake is no exception. This is where each enterprise is going to be different. Consider that you are going to need a unified security, governance and compliance model at this layer for shared data. Immuta could help here. Another approach may be to have some tables in the Silver layer be stored as external tables so that Snowflake can use them. As it turns out, if Snowflake has some table in this layer that Databricks users want to access, that would be very easy.

Conclusion

We’ve seen that using Snowflake and Databricks together is definitely possible, both easier and messier than you may have thought, and might be a very good idea. Snowflake and Databricks have different strengths and weaknesses depending on who is using the data and how. While it is possible to share the data at every point in its lifecycle, there are practical concerns around security, governance and cost that need careful analysis. Get in touch with us if you want to know more about how using Snowflake and Databricks together might fit in with your strategic data initiatives.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us