Data-driven companies are finding more and more use cases where their internal data could be supplemented with external datasets to deliver more business value. At the same time, there are legitimate data privacy concerns that need to be addressed, particularly among regulated enterprises in the financial and healthcare sector. There are opportunities here for a platform where sensitive information can be shared among participants in a secure, governed, privacy-preserving manner. A data cleanroom is one architectural model that could serve this need and Databricks Delta Sharing technology offers a viable implementation.
Data Cleanrooms
Google introduced Ads Data Hub in 2017. Ads Data Hub provided advertisers with impression-level insights from partners in retail, publishing, ad agencies, etc in a secure, privacy-first, governed manner. This model was referred to as a data cleanroom. A data cleanroom is a secure and regulated environment where partner organizations can bring their sensitive data, which may contain PII (personally identifiable information) or PHI (personal health information), to be analyzed alongside other private data. Cleanroom member organizations have full control over their data and can decide who to share it with, without exposing any confidential information.
Databricks open source Delta Sharing feature allows enterprises to securely share live data, whether their data systems are on-premise, cloud-based or hybrid and whether or not they are using Databricks. With Delta Sharing, data providers can share live data using the Apache Parquet or Delta Lake format without replicating or moving data to another system. Delta Sharing allows multiple organizations to share data safely and securely, while also enabling a centralized management system (Unity Catalog) that easily audits any shared information. Fine-grained governance is critical to successful cleanroom implementations.
Opportunities with Delta Sharing
Organizations that build data-sharing partnerships around data cleanrooms can begin to get ahead of the curve on three major emerging vectors: privacy regulations, fragmented consumer ecosystems and monetization opportunities.
Regulations surrounding data privacy, like GDPR and CCPA, as well as fluctuations in third-party measuring like Apple’s App Tracking Transparency Framework, have significantly changed how organizations handle data. For example, publishers, advertisers, and digital advertising platforms are moving to Unified ID 2.0 in response to Google’s plan to deprecate third-part cookies in Chrome by 2023. Providing meaningful and effective mechanisms to join customer data among partner organizations will become more complex as privacy laws and practices evolve and data cleanrooms offer of functional solution.
Consumers have more and more options to engage with services and content, whether it be online versus in-office doctor visits or using multiple devices to interact with a content provider. This fragmentation is best addressed through secure, privacy-centric collaboration. Creating a single view of a customer now requires a data cleanroom.
There are opportunities here for companies that are willing to pursue new mechanisms for monetizing data. There is a largely unmet need in the marketplace for privacy-compliant access to external data sources for big data analytics without having to gain direct access to the data or moving the data.
Conclusion
Increases in privacy regulations, data fragmentation and consumer expectations are driving the adoption of data cleanrooms across multiple industries. The Databricks Lakehouse Platform provides everything needed to build, serve and deploy a scalable and flexible data cleanroom that complies with your data privacy and governance requirements. Delta Sharing allows cleanroom data participants to securely share data with others without replicating any of the information. Your data stays under your control, and you’re not stuck using one specific platform. Since all queries are executed on a Databricks-hosted privacy-safe compute, participants will never have access to the raw data, protecting user information. With Unity Catalog, organizations can control who sees which data and stay within privacy requirements. With Databricks, users are not only limited to SQL; you can also run any complex computations and workloads in popular languages such as R, Scala, and Python.