Metadata contention in Unity Catalog can occur in high-throughput Databricks environments, slowing down user queries and impacting performance across the platform. Our Finops strategy shifts left on performance. However, we have found scenarios where clients are still experiencing query slowdowns intermittently and even on optimized queries. As our client’s lakehouse footprint grows, we are seeing an emerging pattern where stress on Unity Catalog can have a downstream drag on performance across the workspace. In some cases, we have identified metadata contention in Unity Catalog as a contributor to unexpected reductions in response times after controlling for more targeted optimizations.
When data ingestion and transformation pipelines rely on structural metadata changes, they introduce several stress points across Unity Catalog’s architecture. These are not isolated to the ingestion job—they ripple across the control plane and affect all users.
CREATE OR REPLACE TABLE
—adds to the metadata transaction load in Unity Catalog. This leads to:
CREATE OR REPLACE TABLE
invalidates the current logical and physical plan cache for all compute clusters referencing the old version. This leads to:
In other blogs, I have said that predictive optimization is the reward for investing in good governance practices with Unity Catalog. One of the key enablers of predictive optimzation is a current, cached logical and physical plan. Every time a table is created, a new logical and physical plan for this and related tables is created. This means that ever time you execute CREATE OR REPLACE TABLE
, you are back to step one for performance optimization. The DROP TABLE + CREATE TABLE pattern will have the same net result.
This is not to say that CREATE OR REPLACE TABLE
is inherently an anti-pattern. It only becomes a potential performance issue at scales, think thousands of jobs rather than hundreds. Its also not the only cuplrit. ALTER TABLE
with structural changes have a similar effect. CREATE OR REPLACE TABLE
is ubiquitous in data ingestion pipelines and it doesn’t start to cause a noticeable issue until is deeply ingrained in your developer’s muscle memory. There are alternatives, though.
There are different techniques you can use that will not invalidate the plan cache.
CREATE TABLE IF NOT EXISTS
+ INSERT OVERWRITE
is probably my first choice because there is a straight code migration path.CREATE TABLE IF NOT EXISTS catalog.schema.table ( id INT, name STRING ) USING DELTA; INSERT OVERWRITE catalog.schema.table SELECT * FROM staging_table;
MERGE INTO
and COPY INTO
have the metadata advantages of the prior solution and support schema evolution as well as concurrency-safe ingestion.MERGE INTO catalog.schema.table t USING (SELECT * FROM staging_table) s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *;
COPY INTO catalog.schema.table FROM '/mnt/source/' FILEFORMAT = PARQUET FORMAT_OPTIONS ('mergeSchema' = 'true');
df.createOrReplaceTempView("job_tmp_view")
CREATE TABLE IF NOT EXISTS catalog.schema.import_data ( id STRING, source STRING, load_date DATE ) PARTITIONED BY (source, load_date); INSERT INTO catalog.schema.import_data PARTITION (source = 'job_xyz', load_date = current_date()) SELECT * FROM staging;
I have summarized the different techniques you can use to minimize plan invalidation. In general, I think INSER OVERWRITE usually works well as a drop-in replacement. You get schema evolution with MERGE INTO and COPY INTO. I am often surprised at how many tables that should be considered temporary are stored. This is just a good exercise to go through with your jobs. Finally, there are occasions when the Partition + INSERT paradigm is preferable to INSERT OVERWRITE, particularly for high-concurrency workloads.
Technique | Metadata Cost | Plan Invalidation | Concurrency-Safe | Schema Evolution | Notes |
---|---|---|---|---|---|
CREATE OR REPLACE TABLE | High | Yes | No | Yes | Use with caution in production |
INSERT OVERWRITE | Low | No | Yes | No | Fast for full refreshes |
MERGE INTO | Medium | No | Yes | Yes | Ideal for idempotent loads |
COPY INTO | Low | No | Yes | Yes | Great with Auto Loader |
TEMP VIEW / TEMP TABLE | None | No | Yes | N/A | Best for intermittent pipeline stages |
Partition + INSERT | Low | No | Yes | No | Efficient for batch-style jobs |
Tuning the performance characteristics of a platform is more complex than single-application performance tuning. Distributed performance is even more complicated at scale, sice strategies and patterns may start to break down as volume and velocity increase.
Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.
]]>A well-known fact about data is that it is a crucial Asset in an organization when managed appropriately. Data governance helps organizations manage data appropriately. Some customers say data governance is an optional best practice but not a mandatory implementation strategy.
Then, ask your customer a few questions:
Let’s explore why data governance is no longer optional in today’s data-driven world.
The world creates millions of terabytes of data every single day. However, 80% of enterprise data remains poor quality, unstructured, inaccurate, or inaccessible, leading to poor decision-making, compliance risks, and inefficiencies.
Poor data quality impacts businesses and costs millions of dollars annually due to lost productivity, missed opportunities, and regulatory fines.
50% of data scientists’ time is wasted cleaning and organizing messy data instead of deriving insights. Without governance, businesses rely on outdated, inconsistent, or redundant data, leading to poor decisions.
A data governance program ensures:
Companies face significant penalties for violating data regulations like GDPR, CCPA, and HIPAA, including substantial fines, potential criminal charges, and reputational damage. They are paying over billions of $ in fines for data breaches and non-compliance.
Data governance programs ensure:
Most small businesses shut down within six months of a data breach, and the average cost of a data breach is now $4.45 million.
A data governance framework can help:
Bad data costs enterprises 30% of their revenue annually. Inefficient data management leads to:
Data governance programs ensure:
Most executives say their teams make decisions based on siloed data, which creates inefficiencies, misaligned strategies, and lost revenue opportunities.
A data governance program can ensure:
93% of companies that experience significant data loss without backup shut down within one year. Without governance, businesses struggle to recover critical data after a breach or system failure.
A governance program helps:
85% of AI projects fail due to poor data quality. AI models require structured, accurate, and unbiased data, which is impossible without governance.
A strong governance program:
A structured data governance approach turns enterprise data into a competitive advantage. In today’s dynamic business environment, data governance is not just a regulatory requirement—it’s a strategic advantage.
]]>Achieving end-to-end lineage in Databricks while allowing external users to access raw data can be a challenging task. In Databricks, leveraging Unity Catalog for end-to-end lineage is a best practice. However, enabling external users to access raw data while maintaining security and lineage integrity requires a well-thought-out architecture. This blog outlines a reference architecture to achieve this balance.
To meet the needs of both internal and external users, the architecture must:
The architecture starts with a shared data lake as a landing zone for raw, unprocessed data from various sources. This data lake is located in external cloud storage, such as AWS S3 or Azure Data Lake, and is independent of Databricks. Access to this data is managed using IAM roles and policies, allowing both Databricks and external users to interact with the data without overlapping permissions.
Benefits:
The bronze layer ingests raw data from the shared data lake into Databricks. Using Delta Live Tables (DLT), data is processed and stored as managed or external Delta tables. Unity Catalog governs these tables, enforcing fine-grained access control to maintain data security and lineage. End-to-end lineage and Databricks begins with the bronse layer and can be easily maintained throughout silver and gold by using DLTs.
Governance:
Subsequent data processing transforms bronze data into refined (silver) and aggregated (gold) tables. These layers are exclusively managed within Databricks to ensure lineage continuity, leveraging Delta Lake’s optimization features.
Access:
This reference architecture offers a balanced approach to handling raw data access while maintaining governance and lineage within Databricks. By isolating raw data in a shared lake and managing processed data within Databricks, organizations can effectively support both internal analytics and external data sharing.
Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.
]]>Deletion Vectors will be enabled by default in Delta Live Tables (DLTs) for materialized views and streaming tables starting April 28, 2025. Predictive Optimization for DLT maintenance will also be enabled by default. This could provide both cost savings and performance improvements. Our Databricks Practice holds FinOps as a core architectural tenet, but sometimes compliance overrules cost savings.
Deletion vectors are a storage optimization feature that replaces physical deletion with soft deletion. The entire underlying Parquet file is immutable by design and must be rewritten when a record is physically deleted. With a soft delete, deletion vectors are marked rather than physically removed, which is a performance boost. There is a catch once we consider data deletion within the context of regulatory compliance.
Data privacy regulations such as GDPR, HIPAA, and CCPA impose strict requirements on organizations handling personally identifiable information (PII) and protected health information (PHI). Ensuring compliant data deletion is a critical challenge for data engineering teams, especially in industries like healthcare, finance, and government. However; in regulated industries, their default implementation may introduce compliance risks that must be addressed.
Deletion Vectors in Delta Live Tables offer an efficient and scalable way to handle record deletion without requiring expensive file rewrites. Physically removing rows can cause performance degradation due to file rewrites and metadata operations. Instead of physically deleting data, a deletion vector marks records as deleted at the storage layer. These vectors ensure that deleted records are excluded from query results while Predictive Optimization improves storage performance by determining the most cost-effective time to run. There is no way to align this automated procedure with organizational retention policies. This can expose your organization to regulatory compliance risk.
While Deletion Vectors improve performance, they present potential challenges for regulated enterprises:
Organizations that require strict compliance should implement the following measures to enforce hard deletes when necessary:
To ensure that records are permanently removed rather than just hidden:
DELETE
operations followed by OPTIMIZE BY
to force data compaction and file rewrites.VACUUM
with a short retention period to permanently remove deleted data.REORG TABLE … APPLY (PURGE)
to physically exclude soft-deleted records.Unity Catalog can help enforce compliance by:
Databricks provides system tables and information schema that can be leveraged for compliance monitoring:
Deletion Vectors in Delta Live Tables provide a modern approach to data deletion, addressing both performance and compliance concerns for regulated industries. However, their default soft-delete behavior may not align with strict data privacy regulations or internal deletion policies. Enterprises must implement additional safeguards such as physical deletion workflows, Unity Catalog tagging, and system table monitoring to ensure full compliance.
As an Elite Databricks Partner, we are here to help organizations operating under stringent data privacy laws obtain a clear understanding of Deletion Vectors’ limitations—along with proactive remediation strategies—to ensure their data deletion practices meet both legal and internal governance requirements.
Contact us to explore how we can integrate these fast-moving, new Databricks capabilities into your enterprise solutions and drive real business impact.
]]>Perficient has a FinOps mindset with Databricks, so the Automatic Liquid Clustering announcement grabbed my attention.
I’ve mentioned Liquid Clustering before when discussing the advantages of Unity Catalog beyond governance use cases. Unity Catalog: come for the data governance, stay for the predictive optimization. I am usually a fan of being able to tune the dials of Databricks. In this case, Liquid Clustering addresses the data management and query optimization aspects of cost control so simply and elegantly that I’m happy to take my hands off the controls.
Experienced Databricks data engineers are familiar with partitioning and data-skipping strategies to increase performance and reduce costs for their workloads. These topics are even in the certification exams.
Partitioning is set on table creation, while Z-Order columns are applied with the OPTIMIZE
command.
Simple in theory; frustrating in practice.
In all fairness, I think most of us were partitioning wrong. In my case, I had initially approached partitioning a Delta table as if it were a Hive table or a Parquet file. This made intuitive sense to me as an early Spark developer, and I had deep knowledge of both architectures. Yet, repeatedly, I’d find myself staring wistfully into the middle distance through the ashes of another failed optimization attempt.
Databricks clearly saw that manual tuning didn’t scale. So, they introduced a better way.
Ingestion Time Clustering was introduced to address the issues with custom partitioning and Z-Ordering. This approach was taken based on their assumption that 51% of tables are partitioned on date/time keys. Now, we have a solution for about half of our workloads, which is great. But what about the other half?
Liquid Clustering addresses additional use cases beyond date/time partitioning. Addressing partitioning’s limitations with concurrent write requirements was a big step forward in reliability. This is also a better solution for managing tables where access patterns change over time and potential keys may not result in well-sized partitions. It also manages tables filtered by high cardinality columns like Z-Order without additional costs. It adds the ability to manage tables with significant skew as well as tables that experience rapid growth. Databricks recommends enabling Liquid Clustering for all Delta tables, including materialized views and streaming tables. The syntax is very straightforward:
CLUSTER BY (col1)
It seems pretty simple: use liquid clustering everywhere and identify the column on which to cluster. How much simpler could it get?
Now, we find ourselves at a logical conclusion.
Unity Catalog collects statistics on managed tables and automatically identifies when OPTIMIZE
, VACUUM
, and ANALYZE
maintenance operations should be run. Historical workloads for a managed table are analyzed asynchronously as an additional maintenance operation to inform candidates of clustering keys.
You may have noticed by the syntax (CLUSTER BY (col1)
) that Liquid Clustering is still vulnerable to changing access patterns invalidating initial partition key selection. Clustering keys are changed when the predicted cost savings from data skipping outweigh the data clustering cost.
In other words,
CLUSTER BY AUTO
Data is in a very exciting but very tough place right now. Mainstream corporate acceptance of AI/ML means data engineers need to work harder than ever to get lots of data from disparate sources available to everything from SQL Warehouses to ML to RAGs to agentic solutions, while maintaining and improving on security and governance. Add downward pressure on budgets as cloud costs are perceived as too high. Optimization tuning is not a value-add at this point.
Keep Calm and Cluster by Auto.
Get in touch with us if you want to know more about how Automatic Liquid Clustering in Databricks could help you improve performance and bring costs down.
]]>
SAP Databricks is important because convenient access to governed data to support business initiatives is important. Breaking down silos has been a drumbeat of data professionals since Hadoop, but this SAP <-> Databricks initiative may help to solve one of the more intractable data engineering problems out there. SAP has a large, critical data footprint in many large enterprises. However, SAP has an opaque data model. There was always a long painful process to do the glue work required to move the data while recognizing no real value was being realized in that intermediate process. This caused a lot of projects to be delayed, fail, or not pursued resulting in a pretty significant lost opportunity cost for the client and a potential loss of trust or confidence in the system integrator. SAP recognized this and partnered with a small handful of companies to enhance and enlarge the scope of their offering. Databricks was selected to deliver bi-directional integration with their Databricks Lakehouse platform. When I heard there was going to be a big announcement, I thought we were going to hear about a new Lakehouse Federation Connector. That would have been great; I’m a fan.
This was bigger.
Technical details are still emerging, so I’m going to try to focus on what I heard and what I think I know. I’m also going to hit on some use cases that we’ve worked on that I think could be directly impacted by this today. I think the most important takeaway for data engineers is that you can now combine SAP with your Lakehouse without pipelines. In both directions. With governance. This is big.
I don’t know much about SAP, so you can definitely learn more here. I want to understand more about the architecture from a Databricks perspective and I was able to find out some information from the Introducing SAP Databricks post on the internal Databricks blog page.
![]() |
This is when it really sunk in that we were not dealing with a new Lakeflow Connector;
SAP Databricks is a native component in the SAP Business Data Cloud and will be sold by SAP as part of their SAP Business Data Cloud offering. It’s not in the diagram here, but you can actually integrate new or existing Databricks instances with SAP Databricks. I don’t want to get ahead of myself, but I would definitely consider putting that other instance of Databricks on another hyperscaler. In my mind, the magic is the dotted line from the blue “Curated context-rich SAP data products” up through the Databricks stack. |
The promise of SAP Databricks is the ability to easily combine SAP data with the rest of the enterprise data. In my mind, easily means no pipelines that touch SAP. The diagram we see with the integration point between SAP and Databricks SAP uses Delta Sharing as the underlying enablement technology.
Delta Sharing is an open-source protocol, developed by Databricks and the Linux Foundation, that provides strong governance and security for sharing data, analytics and AI across internal business units, cloud providers and applications. Data remains in its original location with Delta Sharing: you are sharing live data with no replication. Delta Share, in combination with Unity Catalog, allows a provider to grant access to one or more recipients and dictate what data can be seen by those shares using row and column-level security.
Databricks leverages Unity Catalog for security and governance across the platform including Delta Share. Unity Catalog offers strong authentication, asset-level access control and secure credential vending to provide a single, unified, open solution for protecting both (semi- & un-)structured data and AI assets. Unity Catalog offers a comprehensive solution for enhancing data governance, operational efficiency, and technological performance. By centralizing metadata management, access controls, and data lineage tracking, it simplifies compliance, reduces complexity, and improves query performance across diverse data environments. The seamless integration with Delta Lake unlocks advanced technical features like predictive optimization, leading to faster data access and cost savings. Unity Catalog plays a crucial role in machine learning and AI by providing centralized data governance and secure access to consistent, high-quality datasets, enabling data scientists to efficiently manage and access the data they need while ensuring compliance and data integrity throughout the model development lifecycle.
Databricks is now a first-class Data Warehouse with its Databricks SQL offering. The serverless SQL warehouses have been kind of a game changer for me because they spin up immediately and size elastically. Pro tip: now is a great time to come up with a tagging strategy. You’ll be able to easily connect your BI tool (Tableau, Power BI, etc) to the warehouse for reporting. There are also a lot of really useful AI/BI opportunities available natively now. If you remember in the introduction, I said that I would have been happy had this only been a Lakehouse Federation offering. You still have the ability to take advantage of Federation to discover, query and govern data from Snowflake, Redshift, Salesforce, Teradata and many others all from within a Databricks instance. I’m still wrapping my head around being able to query Salesforce and SAP Data in a notebook inside Databricks inside SAP.
As a data engineer, I was the most excited about zero-copy, bi-directional SAP data flow into Databricks. This is selfish because it solves my problems, but its relatively short-sighted. The integration between SAP and Databricks will likely deliver the most value through Agentic AI. Lets stipulate that I believe that chat is not the future of GenAI. This is not a bold statement; most people agree with me. Assistants like co-pilots represented a strong path forward. SAP thought so, hence Joule. It appears that SAP is leveraging the Databricks platform in general and MosaicAI in particular to provide a next generation of Joule which will be an AI copilot infused with agents.
SAP Databricks is important because convenient access to governed data to support business initiatives is important. Breaking down silos has been a drumbeat of data professionals since Hadoop, but this SAP <-> Databricks initiative may help to solve one of the more intractable data engineering problems out there. SAP has a large, critical data footprint in many large enterprises. However, SAP has an opaque data model. There was always a long painful process to do the glue work required to move the data while recognizing no real value was being realized in that intermediate process. This caused a lot of projects to be delayed, fail, or not pursued resulting in a pretty significant lost opportunity cost for the client and a potential loss of trust or confidence in the system integrator. SAP recognized this and partnered with a small handful of companies to enhance and enlarge the scope of their offering. Databricks was selected to deliver bi-directional integration with their Databricks Lakehouse platform. When I heard there was going to be a big announcement, I thought we were going to hear about a new Lakehouse Federation Connector. That would have been great; I’m a fan.
This was bigger.
Technical details are still emerging, so I’m going to try to focus on what I heard and what I think I know. I’m also going to hit on some use cases that we’ve worked on that I think could be directly impacted by this today. I think the most important takeaway for data engineers is that you can now combine SAP with your Lakehouse without pipelines. In both directions. With governance. This is big.
I don’t know much about SAP, so you can definitely learn more here. I want to understand more about the architecture from a Databricks perspective and I was able to find out some information from the Introducing SAP Databricks post on the internal Databricks blog page.
![]() |
This is when it really sunk in that we were not dealing with a new Lakeflow Connector;
SAP Databricks is a native component in the SAP Business Data Cloud and will be sold by SAP as part of their SAP Business Data Cloud offering. It’s not in the diagram here, but you can actually integrate new or existing Databricks instances with SAP Databricks. I don’t want to get ahead of myself, but I would definitely consider putting that other instance of Databricks on another hyperscaler. In my mind, the magic is the dotted line from the blue “Curated context-rich SAP data products” up through the Databricks stack. |
The promise of SAP Databricks is the ability to easily combine SAP data with the rest of the enterprise data. In my mind, easily means no pipelines that touch SAP. The diagram we see with the integration point between SAP and Databricks SAP uses Delta Sharing is the underlying enablement technology.
Delta Sharing is an open-source protocol, developed by Databricks and the Linux Foundation, that provides strong governance and security for sharing data, analytics and AI across internal business units, clouds providers and applications. Data remains in its original location with Delta Sharing: you are sharing live data with no replication. Delta Share, in combination with Unity Catalog, allows a provider to grant access to one or more recipients and dictate what data can be seen by those shares using row and column-level security.
Databricks leverages Unity Catalog for security and governance across the platform including Delta Share. Unity Catalog offers strong authentication, asset-level access control and secure credential vending to provide a single, unified, open solution for protecting both (semi- & un-)structured data and AI assets. Unity Catalog offers a comprehensive solution for enhancing data governance, operational efficiency, and technological performance. By centralizing metadata management, access controls, and data lineage tracking, it simplifies compliance, reduces complexity, and improves query performance across diverse data environments. The seamless integration with Delta Lake unlocks advanced technical features like predictive optimization, leading to faster data access and cost savings. Unity Catalog plays a crucial role in machine learning and AI by providing centralized data governance and secure access to consistent, high-quality datasets, enabling data scientists to efficiently manage and access the data they need while ensuring compliance and data integrity throughout the model development lifecycle.
Databricks is now a first-class Data Warehouse with its Databricks SQL offering. The serverless SQL warehouses have been kind of a game changer for me because they spin up immediately and size elastically. Pro tip: now is a great time to come up with a tagging strategy. You’ll be able to easily connect your BI tool (Tableau, PowerBI, etc) to the warehouse for reporting. There are also a lot of really useful AI/BI opportunities available natively now. If you remember in the introduction, I said that I would have been happy had this only been a Lakehouse Federation offering. You still have the ability to take advantage of Federation to discover, query and govern data from Snowflake, Redshift, Salesforce, Teradata and many others all from within a Databricks instance. I’m still wrapping my head around being able to query Salesforce and SAP Data in a notebook inside Databricks inside SAP.
As a data engineer, I was the most excited about zero-copy, bi-directional SAP data flow into Databricks. This is selfish because it solves my problems, but its relatively short-sighted. The integration between SAP and Databricks will likely deliver the most value through Agentic AI. Lets stipulate that I believe that chat is not the future of GenAI. This is not a bold statement; most people agree with me. Assistants like co-pilots represented a strong path forward. SAP thought so, hence Joule. It appears that SAP is leveraging the Databricks platform in general and MosaicAI in particular to provide a next generation of Joule which will be an AI copilot infused with agents.
In a move that has sparked intense discussion across the enterprise software landscape, Klarna announced its decision to drop both Salesforce Sales Cloud and Workday, replacing these industry-leading platforms with its own AI-driven tools. This announcement, led by CEO Sebastian Siemiatkowski, may signal a paradigm shift toward using custom AI agents to manage critical business functions such as customer relationship management (CRM) and human resources (HR). While mostly social media fodder at this point, this very public bet on SaaS replacement has raised important questions about the future of enterprise software and how Agentic AI might reshape the way businesses operate.
Klarna’s move maybe be a one-off internal pivot or it may signal broader shifts that impact enterprises worldwide. Here are three ways this transition could affect the broader market:
As enterprises explore the potential of Agentic AI-driven systems, SaaS providers like Salesforce and Workday must adapt to a new reality. Klarna’s decision could be the first domino in a broader shift, forcing these companies to reconsider their own offerings and strategies. Here are three possible responses we could see from the SaaS giants:
Klarna’s decision to replace SaaS platforms with a custom AI system may represent a significant shift in the enterprise software landscape. While this move highlights the growing potential of AI to reshape key business functions, it also raises important questions about governance, compliance, and the long-term role of SaaS providers. As organizations worldwide watch Klarna’s big bet play out, it’s clear that we are entering a new phase of enterprise software evolution—one where the balance between AI, human oversight, and SaaS will be critical to success.
What do you think? Is Klarna’s move a sign of things to come, or will it encounter challenges that reaffirm the importance of traditional SaaS systems? Let’s continue the SaaS replacement conversation in the comments below!
In the rapidly evolving landscape of digital transformation, businesses are constantly seeking innovative ways to enhance their operations and gain a competitive edge. While Generative AI (GenAI) has been the hot topic since OpenAI introduced ChatGPT to the public in November 2022, a new evolution of the technology is emerging that promises to revolutionize how businesses operate: Agentic AI.
Agentic AI represents a fundamental shift in how we approach intelligence within digital systems.
Unlike the first wave of Generative AI solutions that rely heavily on prompt engineering, agentic AI possesses the ability to make autonomous decisions based on predefined goals, adapting in real-time to changing environments. This enables a deeper level of interaction, as agents are able to “think” about the steps in a more structured and planned approach. With access to web search, outputs are more researched and comprehensive, transforming both efficiency and innovation potential for business.
Key characteristics of Agentic AI include:
As technology evolves at an unprecedented rate, agentic AI is positioned to become the next big thing in tech and business transformation, building upon the foundation laid by generative AI while enhancing automation, resource utilization, scalability, and specialization across various tasks.
Central to this transformation is the concept of the Augmented Enterprise, which leverages advanced technologies to amplify human capabilities and business processes. Agentic Frameworks provide a structured approach to integrating autonomous systems and artificial intelligence (AI) into the enterprise.
Agentic Frameworks refer to the strategic models and methodologies that enable organizations to deploy and manage autonomous agents—software entities that perform tasks on behalf of users or other systems. Use cases include code development, content creation, and more.
Unlike traditional approaches that require explicit programming for each sequence of tasks, Agentic Frameworks provide the business integrations to the model and allow it to decide what system calls are appropriate to achieve the business goal.
“The integration of agentic AI through well-designed frameworks marks a pivotal moment in business evolution. It’s not just about automating tasks; it’s about creating intelligent systems that can reason, learn, and adapt alongside human workers, driving innovation and efficiency to new heights.” – Robert Bagley, Director
As we embrace the potential of agentic AI and our AI solutions begin acting on our behalf, developing robust AI strategy and governance frameworks becomes more essential. With the increasing complexity of regulatory environments, Agentic Frameworks must include mechanisms for auditability, compliance, and security, ensuring that the deployment of autonomous agents aligns with legal and ethical standards.
“In the new agentic era, the scope of AI governance and building trust should expand from ethical compliance to include procedural compliance. As these systems become more autonomous, they must both operate within ethical boundaries and align with our organizational values. This is where thoughtful governance becomes a competitive advantage.” – Robert Bagley, Director
To explore how your enterprise can benefit from Agentic Frameworks, implement appropriate governance programs, and become a truly Augmented Enterprise, reach out to Perficient’s team of experts today. Together, we can shape the future of your business in the age of agentic AI.
]]>
Databricks Unity Catalog is a unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform.
Unity Catalog offers a comprehensive solution for enhancing data governance, operational efficiency, and technological performance. By centralizing metadata management, access controls, and data lineage tracking, it simplifies compliance, reduces complexity, and improves query performance across diverse data environments. The seamless integration with Delta Lake unlocks advanced technical features like predictive optimization, leading to faster data access and cost savings. Unity Catalog plays a crucial role in machine learning and AI by providing centralized data governance and secure access to consistent, high-quality datasets, enabling data scientists to efficiently manage and access the data they need while ensuring compliance and data integrity throughout the model development lifecycle.
Unity Catalog brings governance to data across your enterprise. Lakehouse Federation capabilities in Unity Catalog allow you to discover, query, and govern data across data platforms including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Azure SQL Database, Azure Synapse, Google’s BigQuery, and more from within Databricks without moving or copying the data, all within a simplified and unified experience. Unity Catalog supports advanced data-sharing capabilities with Delta Sharing, enabling secure, real-time data sharing across organizations and platforms without the need for data duplication. Additionally, Unity Catalog facilitates the creation of secure data Clean Rooms, where multiple parties can collaborate on shared datasets without compromising data privacy. Its support for multi-cloud and multi-region deployments ensures operational flexibility and reduced latency, while robust security features, including fine-grained access controls, automated compliance auditing, and encryption, help future-proof your data infrastructure.
These capabilities position your organization for scalable, secure, and efficient data management, driving innovation and maintaining a competitive edge. However, this fundamental transition will need to be implemented with minimal disruption to ongoing operations. This is where the Unity Catalog Migration Tool comes into play.
UCX, or the Unity Catalog Migration Tool, is an open source project from Databricks Labs designed to streamline and automate the Unity Catalog migration process. UCX automates much of the work involved in transitioning to Unity Catalog, including migrating metadata, access controls, and governance policies. Migrating metadata ensures the enterprise will have access to data and AI assets after the transition. In additional to data, the migration tool ensures that security policies and access controls are accurately transferred and enforced in the Unity Catalog. This capability is critical for maintaining data security and compliance during and after migration
Databricks is continually developing UCX to better ensure that all your data assets, governance policies, and security controls are seamlessly transferred to Unity Catalog with minimal disruption to ongoing operations. Tooling and automation helps avoid costly downtime or interruptions in data access that could impact business performance, thereby maintaining continuity and productivity. While it is true that automating these processes significantly reduces the time, effort, and cost required for migration, the process is not automatic. There needs to be evaluation, planning, quality control, change management and additional coding and development tasks performed along with, and outside of, the tool. This knowledge and expertise is where Unity Catalog Migration Partners come into play.
An experienced Unity Catalog migration partner leads the process of transitioning your data assets, governance policies, and security controls by planning, executing, and managing the migration process, ensuring that it is smooth, efficient, and aligned with your organization’s data governance and security requirements. Their duties typically include assessing the current data environment, designing a custom migration strategy, executing the migration while minimizing downtime and disruptions, and providing post-migration support to optimize Unity Catalog’s features. Additionally, they offer expertise in data governance best practices and technical guidance to enhance your organization’s data management capabilities.
Databricks provides its system integrators with tools, guidance and best practices to ensure a smooth transition to Unity Catalog. Perficient has built upon those valuable resources to enable a more effective pipeline with our Unity Catalog Migration Accelerator.
Our approach to Unity Catalog migration is differentiated by our proprietary Accelerator, which includes a suite of project management artifacts and comprehensive code and data quality checks. This Accelerator streamlines the migration process by providing a structured framework that ensures all aspects of the migration are meticulously planned, tracked, and executed, reducing the risk of errors and delays. The built-in code and data quality checks automatically identify and resolve potential issues before they become problems, ensuring a seamless transition with minimal impact on business operations. By leveraging our Accelerator, clients benefit from a more efficient migration process, higher data integrity, and enhanced overall data governance, setting us apart from other Unity Catalog migration partners who may not offer such tailored and robust solutions.
In summary, Unity Catalog provides a powerful solution for modernizing data governance, enhancing performance, and supporting advanced data operations like machine learning and AI. With our specialized Unity Catalog migration services and unique Accelerator, we offer a seamless transition that optimizes data management and security while ensuring data quality and operational efficiency. If you’re ready to unlock the full potential of Unity Catalog and take your data infrastructure to the next level, contact us today to learn how we can help you achieve a smooth and successful migration. Contact us for a complimentary Migration Analysis and let’s work together on your data and AI journey!
]]>We are witnessing a sea-change in the way data is managed by banks and financial institutions all over the world. Data being commoditized and, in some cases, even monetized by banks is the order of the day. Though this seems to be at a stage where some more push is required in terms of adoption in the risk management function. Traditional risk managers, by their job definition, are highly cautious of the result sets provided by the analytics teams. I have even heard the phrase “Please check the report, I don’t understand the models and hence trust the number”.
So, in the risk function, while this is a race for data aggregation, structured data, unstructured data, data quality, data granularity, news feeds, market overviews, its also a challenge from an acceptance perspective. The vision is that all of the data can be aggregated, harmonized and used for better, faster and more informed decision making for Financial and Non Financial Risk Management. The interdependencies between the risks were factors that were not considered in the “Good Old Days” of risk management (pun intended).
Based on my experience, here are the common issues that are faced by banks running a risk of not having a good risk data strategy.
1. The IT-Business tussle (“YOU don’t know what YOU are doing”)
This according to me is the biggest challenge facing traditional banks, especially in the risk function. “The Business”, in traditional banks, is treated like a larger-than-life entity that needs to be supported by IT. This notion of IT being the service provider, whilst business is the “bread-earner”, especially in the traditional banks’ risk departments; does not hold good anymore. It has been proven time and again that the two cannot function without each other and that’s what needs to be cultivated as a management mindset for strategic data management effort as well. This is a culture change, but it’s happening slowly and will have to be adapted industry-wide. It has been proven that the financial institutions with the most organized data have a significant market advantage.
2. Data Overload (“Dude! where’s my Insight”)
The primary goal of data management, sourcing and aggregation effort will have to be converting data into informational insights. The team analyzing the data warehouses, the data lakes and aiding the analytics will have to have this one major organizational goal in mind. Banks have silos, these silos have been created due to mergers, regulations, entities, risk types, chinese walls, data protection, land laws or sometimes just technological challenges over time. The solution to most this is to start with a clean slate. The management mandate for getting the right people to talk and be vested in this change is crucial, challenging but crucial. Good old analysis techniques and brain storming sessions for weeding out what is unnecessary and getting the right set of elements is the key. This needs an overhaul in the way the banking business has been traditionally looking at data i.e. something that is needed for reporting. Understanding of the data lineage and touchpoint systems is most crucial.
3. The CDO Dilemma (“To meta or not to meta”)
The CDO’s role in most banks is now well defined. The risk and compliance analytics and reporting division almost solely depends on the CDO function for insights on regulatory reporting and other forms of innovative data analytics. The key success factor of the CDO organization lies in allocation of the right set of analysts to the business areas. A CDO analyst on the market risk side, for instance, will have to be well versed with market data, bank hierarchies, VaR Calculation engines, Risk not in VaR (RNiV); supporting reference data in addition to the trade systems data that these data elements will have a direct or indirect impact on. Notwithstanding the critical data elements. An additional understanding of how this would impact other forms of risk reporting, like credit risk and non-financial risk is definitely a nice to have. Defining a meta-data strategy for the full lineage, its touch-points and transformations is a strenuous effort in analysis of systems owned by disparate teams with siloed implementation patterns over time. One fix that I saw working is that every significant application group / team can have a senior representative for the CDO interaction. Vested stakeholder interest is turning out to be the one major success factor in the programs that have been successful. This ascertains completeness of the critical data elements definition and hence aid data governance strategy in a wholesome way.
4. The ever-changing nature of financial risk management (“What did they change now?”)
The Basel Committee recommendations have been consistent in driving the urge to reinvent processes in the risk management area. With Fundamental Review of the Trading Book (FRTB) the focus has been very clearly realigned to data processes in organizations. Whilst the big banks already had demonstrated a sound understanding of modellable risk factors based on scenarios, this time the Basel committee has also asked banks to focus on Non-Modellable Risk factors (NMRF). Add the standard approach (sensitivities defined by regulator) and internal models approach (IMA – Bank defined enhanced sensitivities), the change from entity based risk calculations to desk based is a significant paradigm shift. Single golden-source definition for transaction data along with desk structure validation seems to be a major area of concern amongst banks.
Add climate risk to the mix with the Paris accord, the RWA calculations will now need additional data points, additional models and additional investment in external data defining the physical and transition risk associated. Data-lake / Big Data solutions with defined critical data elements and a full log of transformations with respect to lineage is a significant investment but will only work in favor of any more changes that come through on the regulations side. There have always been banks that have been great at this consistently and banks that lag significantly.
All and all, risk management happens to be a great use case for a greenfield CDO data strategy implementation, and these hurdles have to be handled before the ultimate Zen goal of a perfect risk data strategy. Believe me, the first step is to get the bank’s consolidated risk data strategy right and everything else will follow.
This is a 2021 article, also published here – Risk Management Data Strategy – Insights from an Inquisitive Overseer | LinkedIn
]]>The goal of Databricks Unity Catalog is to provide centralized security and management to data and AI assets across the data lakehouse. Unity Catalog provides fine-grained access control for all the securable objects in the lakehouse; databases, tables, files and even models. Gone are the limitations of the Hive metadata store. The Unity Catalog metastore manages all data and AI assets across different workspaces and storage locations. Providing this level of access control substantially increases the quality of governance while reducing the workload involved. There is an additional target of opportunity with tagging.
Tags are metadata elements structured as key-value pairs that can be attached to any asset in the lakehouse. Tagging can make these assets more seachable, manageable and governable. A well-structured, well-executed tagging strategy can enhance data classification, enable regulatory compliance and streamline data lifecycle management. The first step is to identify a use case that could be used as a Proof of Value in your organization. A well-structured tagging strategy means that you will need buy-in and participation from multiple stakeholders, include technical resources, SMEs and a sponsor. These are five common use cases for tagging that might find some traction in a regulated enterprise because they can usually be piggy-backed off an existing or upcoming initiative:
There is always room for an additional mechanism to help safely manage PII (personally identifiable information). A basic initial implementation of tagging could be as simple as applying a PII tag to classify data based on sensitivity. These tags can then be integrated with access control policies in Unity Catalog to automatically grant or restrict access to sensitive data. Balancing the promise of data access in the lakehouse with the regulatory realities surrounding sensitive data is always difficult. Additional tools are always welcome here.
Some organizations struggle with the concept of managing different environments in Databricks. This is particularly true when they are moving from a data landscape where there were specific servers for each environment. Tags can be used to identify stages (ex: dev, test, and prod). These tags can then be leveraged to implement policies and practices around moving data through different lifecycle stages. For example, masking policies or transformation steps may be different between environments. Tags can also be used to facilitate rules around deliberate destruction of sensitive data. Geo-coding data with tags to comply with European regulations is also a possible target of opportunity.
There can be a benefit in attaching descriptive tags directly to the data for cataloging and discovery even if you are already using an external tool. Adding descriptive tags like ‘customer’ or ‘marketing’ directly to the data assets themselves can make it more convenient for analysts and data scientist to perform searches and therefore more likely to be actually used.
This is related to, and can be used in conjunction with, data classification and security. Applying tags such as ‘GDPR’ or ‘HIPAA’ can make performing audits for regulators much simpler. These tags can be used in conjunction with security tags. In an increasing regulated data environment, it pays to make your data assets easy to regulate.
This tagging strategy can be used to organize data assets based on project, teams or departments. This can facilitate project management and improve collaboration by identifying which organizational unit owns or is working with a particular data asset.
There are some practical considerations when implementing a tagging program:
A well-executed tagging strategy will involve some level of automation. It is possible to manage tags in the Catalog Explorer. This can be a good way to kick the tires in the very beginning but automation is critical for a consistent, comprehensive application of the tagging strategy. Good governance is automated. While tagging is available to all securable objects, you will likely start out applying tags to tables.
The information schema tables will have the tag information. However, Databricks Runtime 13.3 and above allows tag management through SQL commands. This is the preferred mechanism because it is so much easier to use than querying the information schema. Regardless of the mechanism used, a user must have the APPLY TAG privilege on the object, the USE SCHEMA privilege on the object’s parent schema and the USE CATALOG privilege on the object’s parent catalog. This is pretty typical with Unity Catalog’s three-tiered hierarchy. If you are using SQL commands to manage tags, you can use the SET TAGS and UNSET TAGS clauses in the ALTER TABLE command.
You can use a fairly straightforward PySpark script to loop through a set of tables, look for a certain set of column names and then apply tags as appropriate. This can be done as an initial one-time run and then automated by creating a distinct job to check for new tables and/or columns or include in existing ingestion processes. There is a lot to be gained by augmenting this pipeline from just using a script that checks for columns named ‘ssn’ to creating an ML job that looks for fields that contain social security numbers.
I’ve seen a lot of companies struggle with populating their Databricks Lakehouse with sensitive data. In their current state, databases had a very limited set of users, so only people that were authorized to see certain data, like PII, had access to the database that stored this information. However, the utility of a lakehouse is dramatically reduced if you don’t allow sensitive data. In most cases, it just won’t get any enterprise traction. Leveraging all of the governance and security feature of Unity Catalog is a great, if not mandatory, first step. Enhancing governance and security, as well as utility, with tagging is probably going to be necessary to one degree or another in your organization to get broad usage and acceptance.
Contact us to learn more about how to build robustly governed solutions in Databricks for your organization.
]]>