data governance Articles / Blogs / Perficient https://blogs.perficient.com/tag/data-governance/ Expert Digital Insights Tue, 15 Apr 2025 16:39:45 +0000 en-US hourly 1 https://blogs.perficient.com/files/favicon-194x194-1-150x150.png data governance Articles / Blogs / Perficient https://blogs.perficient.com/tag/data-governance/ 32 32 30508587 Avoiding Metadata Contention in Unity Catalog https://blogs.perficient.com/2025/04/07/avoiding-metadata-contention-in-unity-catalog/ https://blogs.perficient.com/2025/04/07/avoiding-metadata-contention-in-unity-catalog/#respond Mon, 07 Apr 2025 21:03:05 +0000 https://blogs.perficient.com/?p=379701

Metadata contention in Unity Catalog can occur in high-throughput Databricks environments, slowing down user queries and impacting performance across the platform. Our Finops strategy shifts left on performance. However, we have found scenarios where clients are still experiencing query slowdowns intermittently and even on optimized queries. As our client’s lakehouse footprint grows, we are seeing an emerging pattern where stress on Unity Catalog can have a downstream drag on performance across the workspace. In some cases, we have identified metadata contention in Unity Catalog as a contributor to unexpected reductions in response times after controlling for more targeted optimizations.

How Metadata Contention Can Slow Down User Queries

When data ingestion and transformation pipelines rely on structural metadata changes, they introduce several stress points across Unity Catalog’s architecture. These are not isolated to the ingestion job—they ripple across the control plane and affect all users.

  • Control Plane Saturation – Control plane saturation, often seen in distributed systems like Databricks, refers to the state when administrative functions (like schema updates, access control enforcement, and lineage tracking) overwhelm their processing capacity. Every structural table modification—especially those via CREATE OR REPLACE TABLE—adds to the metadata transaction load in Unity Catalog. This leads to:
    • Delayed responses from the catalog API
    • Increased latency in permission resolution
    • Slower query planning, even for unrelated queries
  • Metastore Lock Contention – Each table creation or replacement operation requires exclusive locks on the underlying metastore objects. When many jobs concurrently attempt these operations:
  • Query Plan Invalidation Cascade – CREATE OR REPLACE TABLE invalidates the current logical and physical plan cache for all compute clusters referencing the old version. This leads to:
    • Increased query planning time across clusters
    • Unpredictable performance for dashboards or interactive workloads
    • Reduced cache utilization across Spark executors
  • Schema Propagation Overhead – Structural changes to a table (e.g., column additions, type changes) must propagate to all services relying on schema consistency. This includes:
  • Multi-tenant Cross-Job Interference – Unity Catalog is a shared control plane. When one tenant (or set of jobs) aggressively replaces tables, the metadata operations can delay or block unrelated tenants. This leads to:
    • Slow query startup times for interactive users
    • Cluster spin-up delays due to metadata prefetch slowness
    • Support escalation from unrelated teams

The CREATE OR REPLACE Reset

In other blogs, I have said that predictive optimization is the reward for investing in good governance practices with Unity Catalog. One of the key enablers of predictive optimzation is a current, cached logical and physical plan. Every time a table is created, a new logical and physical plan for this and related tables is created. This means that ever time you execute CREATE OR REPLACE TABLE, you are back to step one for performance optimization. The DROP TABLE + CREATE TABLE pattern will have the same net result.

This is not to say that CREATE OR REPLACE TABLE is inherently an anti-pattern. It only becomes a potential performance issue at scales, think thousands of jobs rather than hundreds. Its also not the only cuplrit. ALTER TABLE with structural changes have a similar effect. CREATE OR REPLACE TABLE is ubiquitous in data ingestion pipelines and it doesn’t start to cause a noticeable issue until is deeply ingrained in your developer’s muscle memory. There are alternatives, though.

Summary of Alternatives

There are different techniques you can use that will not invalidate the plan cache.

  • Use CREATE TABLE IF NOT EXISTSINSERT OVERWRITE is probably my first choice because there is a straight code migration path.
CREATE TABLE IF NOT EXISTS catalog.schema.table (
id INT,
name STRING
) USING DELTA;
INSERT OVERWRITE catalog.schema.table
SELECT * FROM staging_table;
  • Both MERGE INTO and  COPY INTO have the metadata advantages of the prior solution and support schema evolution as well as concurrency-safe ingestion.
MERGE INTO catalog.schema.table t
USING (SELECT * FROM staging_table) s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
COPY INTO catalog.schema.table
FROM '/mnt/source/'
FILEFORMAT = PARQUET
FORMAT_OPTIONS ('mergeSchema' = 'true');
  • Consider whether you need to be persisting the data beyond the life of the job. If not, consider temporary views or tables. This will avoid Unity Catalog entirely as there is no metadata overhead.
df.createOrReplaceTempView("job_tmp_view")
  • While I prefer Unity Catalog to handle partitioning strategies in the Silver and Gold layer, you can implement a partitioning scheme with your ingestion logic to keep the metadata stable. This is helpful for high-concurrency workloads.
CREATE TABLE IF NOT EXISTS catalog.schema.import_data (
id STRING,
source STRING,
load_date DATE
) PARTITIONED BY (source, load_date);
INSERT INTO catalog.schema.import_data
PARTITION (source = 'job_xyz', load_date = current_date())
SELECT * FROM staging;

I have summarized the different techniques you can use to minimize plan invalidation. In general, I think INSER OVERWRITE usually works well as a drop-in replacement. You get schema evolution with MERGE INTO and COPY INTO. I am often surprised at how many tables that should be considered temporary are stored. This is just a good exercise to go through with your jobs. Finally, there are occasions when the Partition + INSERT paradigm is preferable to INSERT OVERWRITE, particularly for high-concurrency workloads.

TechniqueMetadata CostPlan InvalidationConcurrency-SafeSchema EvolutionNotes
CREATE OR REPLACE TABLEHighYesNoYesUse with caution in production
INSERT OVERWRITELowNoYesNoFast for full refreshes
MERGE INTOMediumNoYesYesIdeal for idempotent loads
COPY INTOLowNoYesYesGreat with Auto Loader
TEMP VIEW / TEMP TABLENoneNoYesN/ABest for intermittent pipeline stages
Partition + INSERTLowNoYesNoEfficient for batch-style jobs

Conclusion

Tuning the performance characteristics of a platform is more complex than single-application performance tuning. Distributed performance is even more complicated at scale, sice strategies and patterns may start to break down as volume and velocity increase.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/04/07/avoiding-metadata-contention-in-unity-catalog/feed/ 0 379701
Why Do Organizations Need Data Governance? https://blogs.perficient.com/2025/04/05/why-do-organizations-need-data-governance/ https://blogs.perficient.com/2025/04/05/why-do-organizations-need-data-governance/#respond Sat, 05 Apr 2025 16:54:49 +0000 https://blogs.perficient.com/?p=379675

A well-known fact about data is that it is a crucial Asset in an organization when managed appropriately. Data governance helps organizations manage data appropriately. Some customers say data governance is an optional best practice but not a mandatory implementation strategy.

Then, ask your customer a few questions:

  • Is your data reliable or trustworthy?
  • Is your data compliant?
  • Is your business protected?
  • Is your data holding you back from making business decisions?
  • Why are You Taking a Risk?

Let’s explore why data governance is no longer optional in today’s data-driven world.

Common Challenges with Organizational Data

The world creates millions of terabytes of data every single day. However, 80% of enterprise data remains poor quality, unstructured, inaccurate, or inaccessible, leading to poor decision-making, compliance risks, and inefficiencies.

Poor data quality impacts businesses and costs millions of dollars annually due to lost productivity, missed opportunities, and regulatory fines.

Learn How Data Governance Impacts Organizations  

1. Improved Data Quality & Decision-Making

50% of data scientists’ time is wasted cleaning and organizing messy data instead of deriving insights. Without governance, businesses rely on outdated, inconsistent, or redundant data, leading to poor decisions.

A data governance program ensures:

  • Data accuracy, consistency, and reliability across all departments
  • Standardized data entry, storage, and usage policies
  • Reduction in data duplication, errors, and conflicting information

2. Regulatory Compliance & Risk Mitigation

Companies face significant penalties for violating data regulations like GDPR, CCPA, and HIPAA, including substantial fines, potential criminal charges, and reputational damage. They are paying over billions of $ in fines for data breaches and non-compliance.

Data governance programs ensure:

  • Proper data classification and retention policies
  • Compliance with industry regulations and security standards
  • Clear data ownership and accountability

3. Enhanced Data Security & Protection Against Breaches

Most small businesses shut down within six months of a data breach, and the average cost of a data breach is now $4.45 million.

A data governance framework can help:

  • Define who has access to what data and when
  • Encrypt and protect sensitive customer and financial data
  • Establish incident response protocols for breaches

4. Increased Operational Efficiency & Cost Savings

Bad data costs enterprises 30% of their revenue annually. Inefficient data management leads to:

  • Wasted employee hours searching for or fixing data
  • Siloed departments working with conflicting data
  • Delays in automation, AI, and analytics

Data governance programs ensure:

  • A single, authoritative source of truth for all teams
  • Elimination of redundant and duplicate data entries
  • Streamlined AI and analytics workflows

5. Breaking Down Data Silos Across Departments

Most executives say their teams make decisions based on siloed data, which creates inefficiencies, misaligned strategies, and lost revenue opportunities.

A data governance program can ensure:

  • A common data language across business units
  • Seamless integration between data platforms (ERP, CRM, Cloud)
  • Cross-functional collaboration for AI and automation projects

6. Better Risk Management & Disaster Recovery

93% of companies that experience significant data loss without backup shut down within one year. Without governance, businesses struggle to recover critical data after a breach or system failure.

A governance program helps:

  • Track data lineage for accountability
  • Ensure data backups and disaster recovery protocols
  • Identify high-risk data and apply extra security layers

7. AI & Digital Transformation Readiness

85% of AI projects fail due to poor data quality. AI models require structured, accurate, and unbiased data, which is impossible without governance.

A strong governance program:

  • Optimizes data for AI, ML, and predictive analytics
  • Prevents bias, inaccuracies, and redundancies in AI models
  • Ensures data is FAIR (Findable, Accessible, Interoperable, and Reusable)

Dg Imp

 

Conclusion

Without Data Governance

  • Data turns from an asset into a liability
  • Inaccurate analytics leads to Poor decision-making
  • Security risks lead to Compliance violations & data breaches
  • Operational inefficiencies lead to Wasted resources & duplicated efforts

With Data Governance

  • Trustworthy, accurate data for better decisions
  • Compliance with GDPR, CCPA, HIPAA, and more
  • Seamless collaboration across teams
  • Scalability as your business grows

A structured data governance approach turns enterprise data into a competitive advantage. In today’s dynamic business environment, data governance is not just a regulatory requirement—it’s a strategic advantage.

]]>
https://blogs.perficient.com/2025/04/05/why-do-organizations-need-data-governance/feed/ 0 379675
End-to-End Lineage and External Raw Data Access in Databricks https://blogs.perficient.com/2025/03/31/eference-architecture-end-to-end-lineage-external-raw-data-access-databricks/ https://blogs.perficient.com/2025/03/31/eference-architecture-end-to-end-lineage-external-raw-data-access-databricks/#respond Mon, 31 Mar 2025 20:01:27 +0000 https://blogs.perficient.com/?p=379496

Achieving end-to-end lineage in Databricks while allowing external users to access raw data can be a challenging task. In Databricks, leveraging Unity Catalog for end-to-end lineage is a best practice. However, enabling external users to access raw data while maintaining security and lineage integrity requires a well-thought-out architecture. This blog outlines a reference architecture to achieve this balance.

Key Requirements

To meet the needs of both internal and external users, the architecture must:

  1. Maintain end-to-end lineage within Databricks using Unity Catalog.
  2. Allow external users to access raw data without compromising governance.
  3. Secure data while maintaining flexibility for different use cases.

Recommended Architecture

1. Shared Raw Data Lake (Pre-Bronze)

The architecture starts with a shared data lake as a landing zone for raw, unprocessed data from various sources. This data lake is located in external cloud storage, such as AWS S3 or Azure Data Lake, and is independent of Databricks. Access to this data is managed using IAM roles and policies, allowing both Databricks and external users to interact with the data without overlapping permissions.

Benefits:

  • External users can access raw data without direct entry into the Databricks Lakehouse.
  • Secure and isolated raw data management.
  • Maintains data availability for non-Databricks consumers.

2. Bronze Layer (Managed by Databricks)

The bronze layer ingests raw data from the shared data lake into Databricks. Using Delta Live Tables (DLT), data is processed and stored as managed or external Delta tables. Unity Catalog governs these tables, enforcing fine-grained access control to maintain data security and lineage. End-to-end lineage and Databricks begins with the bronse layer and can be easily maintained throughout silver and gold by using DLTs.

Governance:

  • Permissions are enforced through Unity Catalog.
  • Data versioning and lineage tracking are maintained within Databricks.

3. Silver and Gold Layers (Processed Data)

Subsequent data processing transforms bronze data into refined (silver) and aggregated (gold) tables. These layers are exclusively managed within Databricks to ensure lineage continuity, leveraging Delta Lake’s optimization features.

Access:

  • Internal users access data through Unity Catalog with appropriate permissions.
  • External users do not have direct access to these curated layers, preserving data quality.

Access Patterns

  • External Users: Access raw data from the shared data lake through configured IAM policies. No direct access to Databricks-managed bronze tables.
  • Internal Users: Access the full data pipeline from bronze to gold within Databricks, leveraging Unity Catalog for secure and controlled access.

Why This Architecture Works

  • Security: Separates raw data from managed bronze, reducing exposure.
  • Governance: Unity Catalog maintains strict access control and lineage.
  • Performance: Internal data processing benefits from Delta Lake optimizations, while raw data remains easily accessible for external systems.

End-to-end lineage in Databricks

This reference architecture offers a balanced approach to handling raw data access while maintaining governance and lineage within Databricks. By isolating raw data in a shared lake and managing processed data within Databricks, organizations can effectively support both internal analytics and external data sharing.

Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

]]>
https://blogs.perficient.com/2025/03/31/eference-architecture-end-to-end-lineage-external-raw-data-access-databricks/feed/ 0 379496
Deletion Vectors in Delta Live Tables: Identifying and Remediating Compliance Risks https://blogs.perficient.com/2025/03/27/deletion-vectors-databricks-compliance/ https://blogs.perficient.com/2025/03/27/deletion-vectors-databricks-compliance/#respond Thu, 27 Mar 2025 19:32:04 +0000 https://blogs.perficient.com/?p=379352

Deletion Vectors will be enabled by default in Delta Live Tables (DLTs) for materialized views and streaming tables starting April 28, 2025. Predictive Optimization for DLT maintenance will also be enabled by default. This could provide both cost savings and performance improvements. Our Databricks Practice holds FinOps as a core architectural tenet, but sometimes compliance overrules cost savings.

Deletion vectors are a storage optimization feature that replaces physical deletion with soft deletion. The entire underlying Parquet file is immutable by design and must be rewritten when a record is physically deleted. With a soft delete, deletion vectors are marked rather than physically removed, which is a performance boost. There is a catch once we consider data deletion within the context of regulatory compliance.

Data privacy regulations such as GDPR, HIPAA, and CCPA impose strict requirements on organizations handling personally identifiable information (PII) and protected health information (PHI). Ensuring compliant data deletion is a critical challenge for data engineering teams, especially in industries like healthcare, finance, and government. However; in regulated industries, their default implementation may introduce compliance risks that must be addressed.

What Are Deletion Vectors?

Deletion Vectors in Delta Live Tables offer an efficient and scalable way to handle record deletion without requiring expensive file rewrites. Physically removing rows can cause performance degradation due to file rewrites and metadata operations. Instead of physically deleting data, a deletion vector marks records as deleted at the storage layer. These vectors ensure that deleted records are excluded from query results while Predictive Optimization improves storage performance by determining the most cost-effective time to run. There is no way to align this automated procedure with organizational retention policies. This can expose your organization to regulatory compliance risk.

Compliance Risks and Potential Issues

While Deletion Vectors improve performance, they present potential challenges for regulated enterprises:

  • Failure to Meet GDPR “Right to be Forgotten” Requirements: GDPR mandates that personal data be fully erased upon request. If data is only hidden via Deletion Vectors and not permanently removed from storage, organizations may face compliance violations.
  • Conflict with Internal Deletion Policies: Enterprises with strict internal policies requiring irreversible deletion may find Deletion Vectors inadequate since they do not physically remove the data.
  • Risk of Data Recovery: Since Deletion Vectors work by marking records as deleted rather than erasing them, it is possible that backup systems, log retention, or forensic tools could restore data that should have been permanently deleted.
  • Cross-Region Data Residency Compliance: Enterprises operating in multiple jurisdictions with strict data localization laws need to ensure that deleted data is not retained in non-compliant locations.
  • Lack of Transparency in Audits: If deletion is managed via metadata instead of physical removal, auditors may require additional proof that data is permanently inaccessible.
  • Impact of Predictive Optimizations: Databricks employs predictive optimizations that may retain deleted records longer than expected for performance reasons, creating additional challenges in enforcing hard deletes.

Remediating Compliance Issues with Deletion Vectors

Organizations that require strict compliance should implement the following measures to enforce hard deletes when necessary:

1. Forcing Hard Deletes When Required

To ensure that records are permanently removed rather than just hidden:

  • Run DELETE operations followed by OPTIMIZE BY to force data compaction and file rewrites.
  • Use VACUUM with a short retention period to permanently remove deleted data.
  • Periodically rewrite tables using REORG TABLE … APPLY (PURGE) to physically exclude soft-deleted records.

2. Tracking and Managing Deletion via Unity Catalog

Unity Catalog can help enforce compliance by:

  • Using table and column tagging to flag PII, PHI, or sensitive data.
  • Creating policy-based access controls to manage deletion workflows.
  • Logging deletion events for auditing and regulatory reporting.
  • Identifying Predictive Optimization Retention Risks: Predictive optimizations in Databricks may delay data removal for efficiency, requiring policy-driven overrides to ensure compliance.

3. Monitoring Deletion Status via System Tables

Databricks provides system tables and information schema that can be leveraged for compliance monitoring:

  • delta.deleted_files: Tracks deleted files and metadata changes.
  • delta.table_history: Maintains a record of all operations performed on the table, allowing auditors to verify deletion processes.
  • SHOW CREATE TABLE: Helps confirm if a table uses Deletion Vectors or requires a different deletion strategy.
  • Predictive Optimization Insights: System tables may provide visibility into optimization delays affecting hard delete execution.

Conclusion

Deletion Vectors in Delta Live Tables provide a modern approach to data deletion, addressing both performance and compliance concerns for regulated industries. However, their default soft-delete behavior may not align with strict data privacy regulations or internal deletion policies. Enterprises must implement additional safeguards such as physical deletion workflows, Unity Catalog tagging, and system table monitoring to ensure full compliance.

As an Elite Databricks Partner, we are here to help organizations operating under stringent data privacy laws obtain a clear understanding of Deletion Vectors’ limitations—along with proactive remediation strategies—to ensure their data deletion practices meet both legal and internal governance requirements.

Contact us to explore how we can integrate these fast-moving, new Databricks capabilities into your enterprise solutions and drive real business impact.

]]>
https://blogs.perficient.com/2025/03/27/deletion-vectors-databricks-compliance/feed/ 0 379352
How Automatic Liquid Clustering Supports Databricks FinOps at Scale https://blogs.perficient.com/2025/03/13/how-automatic-liquid-clustering-supports-databricks-finops-at-scale/ https://blogs.perficient.com/2025/03/13/how-automatic-liquid-clustering-supports-databricks-finops-at-scale/#comments Thu, 13 Mar 2025 13:24:14 +0000 https://blogs.perficient.com/?p=378388

Perficient has a FinOps mindset with Databricks, so the Automatic Liquid Clustering announcement grabbed my attention.

I’ve mentioned Liquid Clustering before when discussing the advantages of Unity Catalog beyond governance use cases. Unity Catalog: come for the data governance, stay for the predictive optimization. I am usually a fan of being able to tune the dials of Databricks. In this case, Liquid Clustering addresses the data management and query optimization aspects of cost control so simply and elegantly that I’m happy to take my hands off the controls.

Manual Tuning: The Struggle Is Real

Experienced Databricks data engineers are familiar with partitioning and data-skipping strategies to increase performance and reduce costs for their workloads. These topics are even in the certification exams.

  • Partitioning involves taking a very large table (1TB or greater) and breaking it down into smaller 1GB chunks based on one or more columns—this method is best for low-cardinality columns.
  • Data-skipping uses statistics stored in the metadata of a table to intelligently find relevant data.
  • Z-Ordering goes even further than data-skipping and co-locates similar information in high-cardinality columns in the same file, improving I/O efficiency.

Partitioning is set on table creation, while Z-Order columns are applied with the OPTIMIZE command.

Simple in theory; frustrating in practice.

In all fairness, I think most of us were partitioning wrong. In my case, I had initially approached partitioning a Delta table as if it were a Hive table or a Parquet file. This made intuitive sense to me as an early Spark developer, and I had deep knowledge of both architectures. Yet, repeatedly, I’d find myself staring wistfully into the middle distance through the ashes of another failed optimization attempt.

  • Queries slowed as access patterns evolved.
  • Optimization efforts produced inconsistent benefits.
  • Z-Ordering introduced write amplification and higher compute costs since it isn’t incremental or on-write.

Databricks clearly saw that manual tuning didn’t scale. So, they introduced a better way.

Ingestion Time Clustering: A Step in the Right Direction

Ingestion Time Clustering was introduced to address the issues with custom partitioning and Z-Ordering. This approach was taken based on their assumption that 51% of tables are partitioned on date/time keys. Now, we have a solution for about half of our workloads, which is great. But what about the other half?

Liquid Clustering: Smarter, Broader Optimization

Liquid Clustering addresses additional use cases beyond date/time partitioning. Addressing partitioning’s limitations with concurrent write requirements was a big step forward in reliability. This is also a better solution for managing tables where access patterns change over time and potential keys may not result in well-sized partitions. It also manages tables filtered by high cardinality columns like Z-Order without additional costs. It adds the ability to manage tables with significant skew as well as tables that experience rapid growth. Databricks recommends enabling Liquid Clustering for all Delta tables, including materialized views and streaming tables. The syntax is very straightforward:

CLUSTER  BY (col1)

It seems pretty simple: use liquid clustering everywhere and identify the column on which to cluster. How much simpler could it get?

Automatic Liquid Clustering: Supports Databricks FinOps at Scale

Now, we find ourselves at a logical conclusion.

Unity Catalog collects statistics on managed tables and automatically identifies when OPTIMIZE, VACUUM, and ANALYZE maintenance operations should be run. Historical workloads for a managed table are analyzed asynchronously as an additional maintenance operation to inform candidates of clustering keys.

You may have noticed by the syntax (CLUSTER BY (col1)) that Liquid Clustering is still vulnerable to changing access patterns invalidating initial partition key selection. Clustering keys are changed when the predicted cost savings from data skipping outweigh the data clustering cost.

In other words,

CLUSTER  BY AUTO

Final Thoughts: Keep Calm and Cluster by Auto

Data is in a very exciting but very tough place right now. Mainstream corporate acceptance of AI/ML means data engineers need to work harder than ever to get lots of data from disparate sources available to everything from SQL Warehouses to ML to RAGs to agentic solutions, while maintaining and improving on security and governance. Add downward pressure on budgets as cloud costs are perceived as too high. Optimization tuning is not a value-add at this point.

Keep Calm and Cluster by Auto.

Want help implementing this in your Databricks environment?

Get in touch with us if you want to know more about how Automatic Liquid Clustering in Databricks could help you improve performance and bring costs down.

 

]]>
https://blogs.perficient.com/2025/03/13/how-automatic-liquid-clustering-supports-databricks-finops-at-scale/feed/ 2 378388
SAP and Databricks: Better Together https://blogs.perficient.com/2025/02/13/sap-and-databricks-better-together-3-2/ https://blogs.perficient.com/2025/02/13/sap-and-databricks-better-together-3-2/#respond Thu, 13 Feb 2025 22:49:26 +0000 https://blogs.perficient.com/?p=377252

SAP Databricks is important because convenient access to governed data to support business initiatives is important. Breaking down silos has been a drumbeat of data professionals since Hadoop, but this SAP <-> Databricks initiative may help to solve one of the more intractable data engineering problems out there. SAP has a large, critical data footprint in many large enterprises. However, SAP has an opaque data model.  There was always a long painful process to do the glue work required to move the data while recognizing no real value was being realized in that intermediate process. This caused a lot of projects to be delayed, fail, or not pursued resulting in a pretty significant lost opportunity cost for the client and a potential loss of trust or confidence in the system integrator. SAP recognized this and partnered with a small handful of companies to enhance and enlarge the scope of their offering. Databricks was selected to deliver bi-directional integration with their Databricks Lakehouse platform. When I heard there was going to be a big announcement, I thought we were going to hear about a new Lakehouse Federation Connector. That would have been great; I’m a fan.

This was bigger.

Technical details are still emerging, so I’m going to try to focus on what I heard and what I think I know. I’m also going to hit on some use cases that we’ve worked on that I think could be directly impacted by this today. I think the most important takeaway for data engineers is that you can now combine SAP with your Lakehouse without pipelines. In both directions. With governance. This is big.

SAP Business Data Cloud

I don’t know much about SAP, so you can definitely learn more here. I want to understand more about the architecture from a Databricks perspective and I was able to find out some information from the Introducing SAP Databricks post on the internal Databricks blog page.

Introducing SAP Databricks This is when it really sunk in that we were not dealing with a new Lakeflow Connector;

SAP Databricks is a native component in the SAP Business Data Cloud and will be sold by SAP as part of their SAP Business Data Cloud offering. It’s not in the diagram here, but you can actually integrate new or existing Databricks instances with SAP Databricks. I don’t want to get ahead of myself, but I would definitely consider putting that other instance of Databricks on another hyperscaler. 🙂

In my mind, the magic is the dotted line from the blue “Curated context-rich SAP data products” up through the Databricks stack.

 

Open Source Sharing

The promise of SAP Databricks is the ability to easily combine SAP data with the rest of the enterprise data. In my mind, easily means no pipelines that touch SAP. The diagram we see with the integration point between SAP and Databricks SAP uses Delta Sharing as the underlying enablement technology.

Delta Sharing is an open-source protocol, developed by Databricks and the Linux Foundation, that provides strong governance and security for sharing data, analytics and AI across internal business units, cloud providers and applications. Data remains in its original location with Delta Sharing: you are sharing live data with no replication. Delta Share, in combination with Unity Catalog, allows a provider to grant access to one or more recipients and dictate what data can be seen by those shares using row and column-level security.

Open Source Governance

Databricks leverages Unity Catalog for security and governance across the platform including Delta Share. Unity Catalog offers strong authentication, asset-level access control and secure credential vending to provide a single, unified, open solution for protecting both (semi- & un-)structured data and AI assets. Unity Catalog offers a comprehensive solution for enhancing data governance, operational efficiency, and technological performance. By centralizing metadata management, access controls, and data lineage tracking, it simplifies compliance, reduces complexity, and improves query performance across diverse data environments. The seamless integration with Delta Lake unlocks advanced technical features like predictive optimization, leading to faster data access and cost savings. Unity Catalog plays a crucial role in machine learning and AI by providing centralized data governance and secure access to consistent, high-quality datasets, enabling data scientists to efficiently manage and access the data they need while ensuring compliance and data integrity throughout the model development lifecycle.

Data Warehousing

Databricks is now a first-class Data Warehouse with its Databricks SQL offering. The serverless SQL warehouses have been kind of a game changer for me because they spin up immediately and size elastically. Pro tip: now is a great time to come up with a tagging strategy. You’ll be able to easily connect your BI tool (Tableau, Power BI, etc) to the warehouse for reporting. There are also a lot of really useful AI/BI opportunities available natively now. If you remember in the introduction, I said that I would have been happy had this only been a Lakehouse Federation offering. You still have the ability to take advantage of Federation to discover, query and govern data from Snowflake, Redshift, Salesforce, Teradata and many others all from within a Databricks instance. I’m still wrapping my head around being able to query Salesforce and SAP Data in a notebook inside Databricks inside SAP.

Mosaic AI + Joule

As a data engineer, I was the most excited about zero-copy, bi-directional SAP data flow into Databricks. This is selfish because it solves my problems, but its relatively short-sighted. The integration between SAP and Databricks will likely deliver the most value through Agentic AI. Lets stipulate that I believe that chat is not the future of GenAI. This is not a bold statement; most people agree with me. Assistants like co-pilots represented a strong path forward. SAP thought so, hence Joule. It appears that SAP is leveraging the Databricks platform in general and MosaicAI in particular to provide a next generation of Joule which will be an AI copilot infused with agents.

Conclusion

The integration of SAP  and the Databricks Lakehouse represents a transformative approach to enterprise data management. By uniting the strengths of SAP’s end-to-end process management and semantically rich data with the advanced analytics and scalability of a lakehouse architecture, organizations can drive better decisions, foster innovation, and simplify their data landscapes. Whether it’s unifying SAP and non-SAP data, enabling real-time insights, or scaling AI initiatives, this partnership provides a roadmap for the future of data-driven enterprises.

Contact us to learn more about how SAP Databricks can help supercharge your enterprise.

 

]]>
https://blogs.perficient.com/2025/02/13/sap-and-databricks-better-together-3-2/feed/ 0 377252
SAP and Databricks: Better Together https://blogs.perficient.com/2024/11/17/sap-and-databricks-better-together-3/ https://blogs.perficient.com/2024/11/17/sap-and-databricks-better-together-3/#respond Sun, 17 Nov 2024 23:07:21 +0000 https://blogs.perficient.com/?p=372152

SAP Databricks is important because convenient access to governed data to support business initiatives is important. Breaking down silos has been a drumbeat of data professionals since Hadoop, but this SAP <-> Databricks initiative may help to solve one of the more intractable data engineering problems out there. SAP has a large, critical data footprint in many large enterprises. However, SAP has an opaque data model.  There was always a long painful process to do the glue work required to move the data while recognizing no real value was being realized in that intermediate process. This caused a lot of projects to be delayed, fail, or not pursued resulting in a pretty significant lost opportunity cost for the client and a potential loss of trust or confidence in the system integrator. SAP recognized this and partnered with a small handful of companies to enhance and enlarge the scope of their offering. Databricks was selected to deliver bi-directional integration with their Databricks Lakehouse platform. When I heard there was going to be a big announcement, I thought we were going to hear about a new Lakehouse Federation Connector. That would have been great; I’m a fan.

This was bigger.

Technical details are still emerging, so I’m going to try to focus on what I heard and what I think I know. I’m also going to hit on some use cases that we’ve worked on that I think could be directly impacted by this today. I think the most important takeaway for data engineers is that you can now combine SAP with your Lakehouse without pipelines. In both directions. With governance. This is big.

SAP Business Data Cloud

I don’t know much about SAP, so you can definitely learn more here. I want to understand more about the architecture from a Databricks perspective and I was able to find out some information from the Introducing SAP Databricks post on the internal Databricks blog page.

Introducing SAP Databricks This is when it really sunk in that we were not dealing with a new Lakeflow Connector;

SAP Databricks is a native component in the SAP Business Data Cloud and will be sold by SAP as part of their SAP Business Data Cloud offering. It’s not in the diagram here, but you can actually integrate new or existing Databricks instances with SAP Databricks. I don’t want to get ahead of myself, but I would definitely consider putting that other instance of Databricks on another hyperscaler. 🙂

In my mind, the magic is the dotted line from the blue “Curated context-rich SAP data products” up through the Databricks stack.

 

Open Source Sharing

The promise of SAP Databricks is the ability to easily combine SAP data with the rest of the enterprise data. In my mind, easily means no pipelines that touch SAP. The diagram we see with the integration point between SAP and Databricks SAP uses Delta Sharing is the underlying enablement technology.

Delta Sharing is an open-source protocol, developed by Databricks and the Linux Foundation, that provides strong governance and security for sharing data, analytics and AI across internal business units, clouds providers and applications. Data remains in its original location with Delta Sharing: you are sharing live data with no replication. Delta Share, in combination with Unity Catalog, allows a provider to grant access to one or more recipients and dictate what data can be seen by those shares using row and column-level security.

Open Source Governance

Databricks leverages Unity Catalog for security and governance across the platform including Delta Share. Unity Catalog offers strong authentication, asset-level access control and secure credential vending to provide a single, unified, open solution for protecting both (semi- & un-)structured data and AI assets. Unity Catalog offers a comprehensive solution for enhancing data governance, operational efficiency, and technological performance. By centralizing metadata management, access controls, and data lineage tracking, it simplifies compliance, reduces complexity, and improves query performance across diverse data environments. The seamless integration with Delta Lake unlocks advanced technical features like predictive optimization, leading to faster data access and cost savings. Unity Catalog plays a crucial role in machine learning and AI by providing centralized data governance and secure access to consistent, high-quality datasets, enabling data scientists to efficiently manage and access the data they need while ensuring compliance and data integrity throughout the model development lifecycle.

Data Warehousing

Databricks is now a first-class Data Warehouse with its Databricks SQL offering. The serverless SQL warehouses have been kind of a game changer for me because they spin up immediately and size elastically. Pro tip: now is a great time to come up with a tagging strategy. You’ll be able to easily connect your BI tool (Tableau, PowerBI, etc) to the warehouse for reporting. There are also a lot of really useful AI/BI opportunities available natively now. If you remember in the introduction, I said that I would have been happy had this only been a Lakehouse Federation offering. You still have the ability to take advantage of Federation to discover, query and govern data from Snowflake, Redshift, Salesforce, Teradata and many others all from within a Databricks instance. I’m still wrapping my head around being able to query Salesforce and SAP Data in a notebook inside Databricks inside SAP.

Mosaic AI + Joule

As a data engineer, I was the most excited about zero-copy, bi-directional SAP data flow into Databricks. This is selfish because it solves my problems, but its relatively short-sighted. The integration between SAP and Databricks will likely deliver the most value through Agentic AI. Lets stipulate that I believe that chat is not the future of GenAI. This is not a bold statement; most people agree with me. Assistants like co-pilots represented a strong path forward. SAP thought so, hence Joule. It appears that SAP is leveraging the Databricks platform in general and MosaicAI in particular to provide a next generation of Joule which will be an AI copilot infused with agents.

Conclusion

The integration of SAP  and the Databricks Lakehouse represents a transformative approach to enterprise data management. By uniting the strengths of SAP’s end-to-end process management and semantically rich data with the advanced analytics and scalability of a lakehouse architecture, organizations can drive better decisions, foster innovation, and simplify their data landscapes. Whether it’s unifying SAP and non-SAP data, enabling real-time insights, or scaling AI initiatives, this partnership provides a roadmap for the future of data-driven enterprises.

Contact us to learn more about how SAP Databricks can help supercharge your enterprise.

 

]]>
https://blogs.perficient.com/2024/11/17/sap-and-databricks-better-together-3/feed/ 0 372152
A New Era of AI Agents in the Enterprise? https://blogs.perficient.com/2024/10/22/a-new-era-of-custom-ai-in-the-enterprise/ https://blogs.perficient.com/2024/10/22/a-new-era-of-custom-ai-in-the-enterprise/#respond Tue, 22 Oct 2024 18:08:30 +0000 https://blogs.perficient.com/?p=370801

In a move that has sparked intense discussion across the enterprise software landscape, Klarna announced its decision to drop both Salesforce Sales Cloud and Workday, replacing these industry-leading platforms with its own AI-driven tools. This announcement, led by CEO Sebastian Siemiatkowski, may signal a paradigm shift toward using custom AI agents to manage critical business functions such as customer relationship management (CRM) and human resources (HR). While mostly social media fodder at this point, this very public bet on SaaS replacement has raised important questions about the future of enterprise software and how Agentic AI might reshape the way businesses operate.

AI Agents – Impact on Enterprises

Klarna’s move maybe be a one-off internal pivot or it may signal broader shifts that impact enterprises worldwide. Here are three ways this transition could affect the broader market:

  1. Customized AI Over SaaS for Competitive Differentiation Enterprises are always on the lookout for ways to differentiate themselves from the competition. Klarna’s decision may reflect an emerging trend: companies developing custom Agentic AI solutions to better tailor workflows and processes to their specific needs. The advantage here lies in having a system that is purpose-built for an organization’s unique requirements, potentially driving innovation and efficiencies that are difficult to achieve with out-of-the-box software. However, this approach also raises challenges. Building Agentic AI solutions in-house requires significant technical expertise, resources, and time. Not all companies will have the bandwidth to undertake such a transformation, but for those who do, it could become a key differentiator in terms of operational efficiency and personalized customer experiences.
  2. Shift in Vendor Relationships and Power Dynamics If more enterprises follow Klarna’s lead, we could see a shift in the traditional vendor-client dynamic. For years, businesses have relied on SaaS providers like Salesforce and Workday to deliver highly specialized, integrated solutions. However, AI-driven automation might diminish the need for comprehensive, multi-purpose platforms. Instead, companies might lean towards modular, lightweight tech stacks powered by AI agents, allowing for greater control and flexibility. This shift could weaken the power and influence of SaaS providers if enterprises increasingly build customized systems in-house. On the other hand, it could also lead to new forms of partnership between AI providers and SaaS companies, where AI becomes a layer on top of existing systems rather than a full replacement.
  3. Greater Focus on Data and Compliance Risks With AI agents handling sensitive business functions like customer management and HR, companies like Klarna must ensure that data governance, compliance, and security are up to the task. This shift toward Agentic AI requires robust mechanisms to manage customer and employee data, especially in industries with stringent regulatory requirements, like finance and healthcare. Marc Benioff, Salesforce’s CEO, raised these concerns directly, questioning how Klarna will handle compliance, governance, and institutional memory. AI might automate many processes, but without the proper safeguards, it could introduce new risks that legacy SaaS providers have long addressed. Enterprises looking to follow Klarna’s example will need to rethink how they manage these critical issues within their AI-driven frameworks.

AI Agents – SaaS Vendors Respond

As enterprises explore the potential of Agentic AI-driven systems, SaaS providers like Salesforce and Workday must adapt to a new reality. Klarna’s decision could be the first domino in a broader shift, forcing these companies to reconsider their own offerings and strategies. Here are three possible responses we could see from the SaaS giants:

  1. Doubling Down on AI Integration Salesforce and Workday are not standing still. In fact, both companies are already integrating AI into their platforms. Salesforce’s Einstein and the newly introduced Agentforce are examples of AI-powered tools designed to enhance customer interactions and automate tasks. We might see a rapid acceleration of these efforts, with SaaS providers emphasizing Agentic AI-driven features that keep businesses within their ecosystems rather than prompting them to build in-house solutions. However, as Benioff pointed out, the key might be blending AI with human oversight rather than replacing humans altogether. This hybrid approach will allow Salesforce and Workday to differentiate themselves from pure AI solutions by ensuring that critical human elements—like decision-making, customer empathy, and regulatory knowledge—are never lost.
  2. Building Modular and Lightweight Offerings Klarna’s move underscores the desire for flexibility and control over tech stacks. In response, SaaS companies may offer more modular, API-driven solutions that allow enterprises to mix and match components based on their needs. This would enable businesses to take advantage of best-in-class SaaS features without being locked into a monolithic platform. By offering modular systems, Salesforce and Workday could cater to enterprises looking to integrate AI while maintaining the core advantages of established SaaS infrastructure—such as compliance, security, and data management.
  3. Strengthening Data Governance and Compliance as Key Differentiators As AI grows in influence, data governance, compliance, and security will become critical battlegrounds for SaaS providers. SaaS companies like Salesforce and Workday have spent years building trusted systems that comply with various regulatory frameworks. Klarna’s AI approach will be closely scrutinized to ensure it meets these same standards, and any slip-ups could provide an opening for SaaS vendors to argue that their systems remain the gold standard for enterprise-grade compliance. By doubling down on their strengths in these areas, SaaS vendors could position themselves as the safer, more reliable option for enterprises that handle sensitive or regulated data. This approach could attract companies that are hesitant to take the AI plunge without fully understanding the risks.

What’s Next?

Klarna’s decision to replace SaaS platforms with a custom AI system may represent a significant shift in the enterprise software landscape. While this move highlights the growing potential of AI to reshape key business functions, it also raises important questions about governance, compliance, and the long-term role of SaaS providers. As organizations worldwide watch Klarna’s big bet play out, it’s clear that we are entering a new phase of enterprise software evolution—one where the balance between AI, human oversight, and SaaS will be critical to success.

What do you think? Is Klarna’s move a sign of things to come, or will it encounter challenges that reaffirm the importance of traditional SaaS systems? Lets continue the SaaS replacement conversation in the comments below!

]]>
https://blogs.perficient.com/2024/10/22/a-new-era-of-custom-ai-in-the-enterprise/feed/ 0 370801
Agentic AI: The New Frontier in GenAI https://blogs.perficient.com/2024/09/27/agentic-ai-the-new-frontier-in-genai/ https://blogs.perficient.com/2024/09/27/agentic-ai-the-new-frontier-in-genai/#respond Fri, 27 Sep 2024 20:47:17 +0000 https://blogs.perficient.com/?p=369907

In the rapidly evolving landscape of digital transformation, businesses are constantly seeking innovative ways to enhance their operations and gain a competitive edge. While Generative AI (GenAI) has been the hot topic since OpenAI introduced ChatGPT to the public in November 2022, a new evolution of the technology is emerging that promises to revolutionize how businesses operate: Agentic AI. 

What is Agentic AI? 

Agentic AI represents a fundamental shift in how we approach intelligence within digital systems.  

Unlike the first wave of Generative AI solutions that rely heavily on prompt engineering, agentic AI possesses the ability to make autonomous decisions based on predefined goals, adapting in real-time to changing environments. This enables a deeper level of interaction, as agents are able to “think” about the steps in a more structured and planned approach. With access to web search, outputs are more researched and comprehensive, transforming both efficiency and innovation potential for business. 

Key characteristics of Agentic AI include: 

  •   Autonomy: Ability to perform tasks independently based on predefined goals or dynamically changing circumstances. 
  •  Adaptability: Learns from interactions, outcomes, and feedback to make better decisions in the future. 
  • Proactivity: Not only responds to commands but can anticipate needs, automate tasks, and solve problems proactively. 

As technology evolves at an unprecedented rate, agentic AI is positioned to become the next big thing in tech and business transformation, building upon the foundation laid by generative AI while enhancing automation, resource utilization, scalability, and specialization across various tasks. 

Leveraging Agentic Frameworks 

Central to this transformation is the concept of the Augmented Enterprise, which leverages advanced technologies to amplify human capabilities and business processes. Agentic Frameworks provide a structured approach to integrating autonomous systems and artificial intelligence (AI) into the enterprise. 

Agentic Frameworks refer to the strategic models and methodologies that enable organizations to deploy and manage autonomous agents—software entities that perform tasks on behalf of users or other systems. Use cases include code development, content creation, and more.  

Unlike traditional approaches that require explicit programming for each sequence of tasks, Agentic Frameworks provide the business integrations to the model and allow it to decide what system calls are appropriate to achieve the business goal.  

“The integration of agentic AI through well-designed frameworks marks a pivotal moment in business evolution. It’s not just about automating tasks; it’s about creating intelligent systems that can reason, learn, and adapt alongside human workers, driving innovation and efficiency to new heights.” – Robert Bagley, Director 

Governance and Ethical Considerations 

As we embrace the potential of agentic AI and our AI solutions begin acting on our behalf, developing robust AI strategy and governance frameworks becomes more essential. With the increasing complexity of regulatory environments, Agentic Frameworks must include mechanisms for auditability, compliance, and security, ensuring that the deployment of autonomous agents aligns with legal and ethical standards. 

“In the new agentic era, the scope of AI governance and building trust should expand from ethical compliance to include procedural compliance. As these systems become more autonomous, they must both operate within ethical boundaries and align with our organizational values. This is where thoughtful governance becomes a competitive advantage.” – Robert Bagley, Director 

To explore how your enterprise can benefit from Agentic Frameworks, implement appropriate governance programs, and become a truly Augmented Enterprise, reach out to Perficient’s team of experts today. Together, we can shape the future of your business in the age of agentic AI. 

 

]]>
https://blogs.perficient.com/2024/09/27/agentic-ai-the-new-frontier-in-genai/feed/ 0 369907
Maximize Your Data Management with Unity Catalog https://blogs.perficient.com/2024/08/23/unity-catalog-migration-tools-benefits/ https://blogs.perficient.com/2024/08/23/unity-catalog-migration-tools-benefits/#comments Fri, 23 Aug 2024 19:50:17 +0000 https://blogs.perficient.com/?p=368029

Databricks Unity Catalog is a unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform.

UnitycatalogUnity Catalog offers a comprehensive solution for enhancing data governance, operational efficiency, and technological performance. By centralizing metadata management, access controls, and data lineage tracking, it simplifies compliance, reduces complexity, and improves query performance across diverse data environments. The seamless integration with Delta Lake unlocks advanced technical features like predictive optimization, leading to faster data access and cost savings. Unity Catalog plays a crucial role in machine learning and AI by providing centralized data governance and secure access to consistent, high-quality datasets, enabling data scientists to efficiently manage and access the data they need while ensuring compliance and data integrity throughout the model development lifecycle.

Unity Catalog brings governance to data across your enterprise. Lakehouse Federation capabilities in Unity Catalog allow you to discover, query, and govern data across data platforms including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Azure SQL Database, Azure Synapse, Google’s BigQuery, and more from within Databricks without moving or copying the data, all within a simplified and unified experience. Unity Catalog supports advanced data-sharing capabilities with Delta Sharing, enabling secure, real-time data sharing across organizations and platforms without the need for data duplication. Additionally, Unity Catalog facilitates the creation of secure data Clean Rooms, where multiple parties can collaborate on shared datasets without compromising data privacy. Its support for multi-cloud and multi-region deployments ensures operational flexibility and reduced latency, while robust security features, including fine-grained access controls, automated compliance auditing, and encryption, help future-proof your data infrastructure.

These capabilities position your organization for scalable, secure, and efficient data management, driving innovation and maintaining a competitive edge. However, this fundamental transition will need to be implemented with minimal disruption to ongoing operations. This is where the Unity Catalog Migration Tool comes into play.

Unity Catalog Migration Tool

UCX, or the Unity Catalog Migration Tool, is an open source project from Databricks Labs  designed to streamline and automate the Unity Catalog migration process. UCX automates much of the work involved in transitioning to Unity Catalog, including migrating metadata, access controls, and governance policies. Migrating metadata ensures the enterprise will have access to data and AI assets after the transition. In additional to data, the migration tool ensures that security policies and access controls are accurately transferred and enforced in the Unity Catalog. This capability is critical for maintaining data security and compliance during and after migration

Databricks is continually developing UCX to better ensure that all your data assets, governance policies, and security controls are seamlessly transferred to Unity Catalog with minimal disruption to ongoing operations. Tooling and automation helps avoid costly downtime or interruptions in data access that could impact business performance, thereby maintaining continuity and productivity. While it is true that automating these processes significantly reduces the time, effort, and cost required for migration, the process is not automatic. There needs to be evaluation, planning, quality control, change management and additional coding and development tasks performed along with, and outside of, the tool. This knowledge and expertise is where Unity Catalog Migration Partners come into play.

Unity Catalog Migration Partner

An experienced Unity Catalog migration partner leads the process of transitioning your data assets, governance policies, and security controls by planning, executing, and managing the migration process, ensuring that it is smooth, efficient, and aligned with your organization’s data governance and security requirements. Their duties typically include assessing the current data environment, designing a custom migration strategy, executing the migration while minimizing downtime and disruptions, and providing post-migration support to optimize Unity Catalog’s features. Additionally, they offer expertise in data governance best practices and technical guidance to enhance your organization’s data management capabilities.

Databricks provides its system integrators with tools, guidance and best practices to ensure a smooth transition to Unity Catalog. Perficient has built upon those valuable resources to enable a more effective pipeline with our Unity Catalog Migration Accelerator.

Unity Catalog Migration Accelerator

Our approach to Unity Catalog migration is differentiated by our proprietary Accelerator, which includes a suite of project management artifacts and comprehensive code and data quality checks. This Accelerator streamlines the migration process by providing a structured framework that ensures all aspects of the migration are meticulously planned, tracked, and executed, reducing the risk of errors and delays. The built-in code and data quality checks automatically identify and resolve potential issues before they become problems, ensuring a seamless transition with minimal impact on business operations. By leveraging our Accelerator, clients benefit from a more efficient migration process, higher data integrity, and enhanced overall data governance, setting us apart from other Unity Catalog migration partners who may not offer such tailored and robust solutions.

In summary, Unity Catalog provides a powerful solution for modernizing data governance, enhancing performance, and supporting advanced data operations like machine learning and AI. With our specialized Unity Catalog migration services and unique Accelerator, we offer a seamless transition that optimizes data management and security while ensuring data quality and operational efficiency. If you’re ready to unlock the full potential of Unity Catalog and take your data infrastructure to the next level, contact us today to learn how we can help you achieve a smooth and successful migration. Contact us for a complimentary Migration Analysis and let’s work together on your data and AI journey!

]]>
https://blogs.perficient.com/2024/08/23/unity-catalog-migration-tools-benefits/feed/ 1 368029
Risk Management Data Strategy – Insights from an Inquisitive Overseer https://blogs.perficient.com/2024/08/19/risk-management-data-strategy/ https://blogs.perficient.com/2024/08/19/risk-management-data-strategy/#comments Mon, 19 Aug 2024 14:47:52 +0000 https://blogs.perficient.com/?p=367560

We are witnessing a sea-change in the way data is managed by banks and financial institutions all over the world. Data being commoditized and, in some cases, even monetized by banks is the order of the day. Though this seems to be at a stage where some more push is required in terms of adoption in the risk management function. Traditional risk managers, by their job definition, are highly cautious of the result sets provided by the analytics teams. I have even heard the phrase “Please check the report, I don’t understand the models and hence trust the number”.

So, in the risk function, while this is a race for data aggregation, structured data, unstructured data, data quality, data granularity, news feeds, market overviews, its also a challenge from an acceptance perspective. The vision is that all of the data can be aggregated, harmonized and used for better, faster and more informed decision making for Financial and Non Financial Risk Management. The interdependencies between the risks were factors that were not considered in the “Good Old Days” of risk management (pun intended).

Based on my experience, here are the common issues that are faced by banks running a risk of not having a good risk data strategy.

1. The IT-Business tussle (“YOU don’t know what YOU are doing”)

This according to me is the biggest challenge facing traditional banks, especially in the risk function. “The Business”, in traditional banks, is treated like a larger-than-life entity that needs to be supported by IT. This notion of IT being the service provider, whilst business is the “bread-earner”, especially in the traditional banks’ risk departments; does not hold good anymore. It has been proven time and again that the two cannot function without each other and that’s what needs to be cultivated as a management mindset for strategic data management effort as well. This is a culture change, but it’s happening slowly and will have to be adapted industry-wide. It has been proven that the financial institutions with the most organized data have a significant market advantage.

2. Data Overload (“Dude! where’s my Insight”)

The primary goal of data management, sourcing and aggregation effort will have to be converting data into informational insights. The team analyzing the data warehouses, the data lakes and aiding the analytics will have to have this one major organizational goal in mind. Banks have silos, these silos have been created due to mergers, regulations, entities, risk types, chinese walls, data protection, land laws or sometimes just technological challenges over time. The solution to most this is to start with a clean slate. The management mandate for getting the right people to talk and be vested in this change is crucial, challenging but crucial. Good old analysis techniques and brain storming sessions for weeding out what is unnecessary and getting the right set of elements is the key. This needs an overhaul in the way the banking business has been traditionally looking at data i.e. something that is needed for reporting. Understanding of the data lineage and touchpoint systems is most crucial.

3. The CDO Dilemma (“To meta or not to meta”)

The CDO’s role in most banks is now well defined. The risk and compliance analytics and reporting division almost solely depends on the CDO function for insights on regulatory reporting and other forms of innovative data analytics. The key success factor of the CDO organization lies in allocation of the right set of analysts to the business areas. A CDO analyst on the market risk side, for instance, will have to be well versed with market data, bank hierarchies, VaR Calculation engines, Risk not in VaR (RNiV); supporting reference data in addition to the trade systems data that these data elements will have a direct or indirect impact on. Notwithstanding the critical data elements. An additional understanding of how this would impact other forms of risk reporting, like credit risk and non-financial risk is definitely a nice to have. Defining a meta-data strategy for the full lineage, its touch-points and transformations is a strenuous effort in analysis of systems owned by disparate teams with siloed implementation patterns over time. One fix that I saw working is that every significant application group / team can have a senior representative for the CDO interaction. Vested stakeholder interest is turning out to be the one major success factor in the programs that have been successful. This ascertains completeness of the critical data elements definition and hence aid data governance strategy in a wholesome way.

4. The ever-changing nature of financial risk management (“What did they change now?”)

The Basel Committee recommendations have been consistent in driving the urge to reinvent processes in the risk management area. With Fundamental Review of the Trading Book (FRTB) the focus has been very clearly realigned to data processes in organizations. Whilst the big banks already had demonstrated a sound understanding of modellable risk factors based on scenarios, this time the Basel committee has also asked banks to focus on Non-Modellable Risk factors (NMRF). Add the standard approach (sensitivities defined by regulator) and internal models approach (IMA – Bank defined enhanced sensitivities), the change from entity based risk calculations to desk based is a significant paradigm shift. Single golden-source definition for transaction data along with desk structure validation seems to be a major area of concern amongst banks.

Add climate risk to the mix with the Paris accord, the RWA calculations will now need additional data points, additional models and additional investment in external data defining the physical and transition risk associated. Data-lake / Big Data solutions with defined critical data elements and a full log of transformations with respect to lineage is a significant investment but will only work in favor of any more changes that come through on the regulations side. There have always been banks that have been great at this consistently and banks that lag significantly.

All and all, risk management happens to be a great use case for a greenfield CDO data strategy implementation, and these hurdles have to be handled before the ultimate Zen goal of a perfect risk data strategy. Believe me, the first step is to get the bank’s consolidated risk data strategy right and everything else will follow.

 

This is a 2021 article, also published here –  Risk Management Data Strategy – Insights from an Inquisitive Overseer | LinkedIn

]]>
https://blogs.perficient.com/2024/08/19/risk-management-data-strategy/feed/ 1 367560
Data Lake Governance with Tagging in Databricks Unity Catalog https://blogs.perficient.com/2024/02/29/data-lake-governance-with-tagging-in-databricks-unity-catalog/ https://blogs.perficient.com/2024/02/29/data-lake-governance-with-tagging-in-databricks-unity-catalog/#respond Thu, 29 Feb 2024 17:12:46 +0000 https://blogs.perficient.com/?p=357919

The goal of Databricks Unity Catalog is to provide centralized security and management to data and AI assets across the data lakehouse. Unity Catalog provides fine-grained access control for all the securable objects in the lakehouse; databases, tables, files and even models. Gone are the limitations of the Hive metadata store. The Unity Catalog metastore manages all data and AI assets across different workspaces and storage locations. Providing this level of access control substantially increases the quality of governance while reducing the workload involved. There is an additional target of opportunity with tagging.

Tagging Overview

Tags are metadata elements structured as key-value pairs that can be attached to any asset in the lakehouse. Tagging can make these assets more seachable, manageable and governable. A well-structured, well-executed tagging strategy can enhance data classification, enable regulatory compliance and streamline data lifecycle management. The first step is to identify a use case that could be used as a Proof of Value in your organization. A well-structured tagging strategy means that you will need buy-in and participation from multiple stakeholders, include technical resources, SMEs and a sponsor. These are five common use cases for tagging that might find some traction in a regulated enterprise because they can usually be piggy-backed off an existing or upcoming initiative:

  • Data Classification and Security
  • Data Lifecycle Management
  • Data Cataloging and Discovery
  • Compliance and Regulation
  • Project Management and Collaboration

Data Classification and Security

There is always room for an additional mechanism to help safely manage PII (personally identifiable information). A basic initial implementation of tagging could be as simple as applying a PII tag to classify data based on sensitivity. These tags can then be integrated with access control policies in Unity Catalog to automatically grant or restrict access to sensitive data. Balancing the promise of data access in the lakehouse with the regulatory realities surrounding sensitive data is always difficult. Additional tools are always welcome here.

Data Lifecycle Management

Some organizations struggle with the concept of managing different environments in Databricks. This is particularly true when they are moving from a data landscape where there were specific servers for each environment. Tags can be used to identify stages (ex: dev, test, and prod). These tags can then be leveraged to implement policies and practices around moving data through different lifecycle stages. For example, masking policies or transformation steps may be different between environments. Tags can also be used to facilitate rules around deliberate destruction of sensitive data. Geo-coding data with tags to comply with European regulations is also a possible target of opportunity.

Data Cataloging and Discovery

There can be a benefit in attaching descriptive tags directly to the data for cataloging and discovery even if you are already using an external tool. Adding descriptive tags like ‘customer’ or ‘marketing’ directly to the data assets themselves can make it more convenient for analysts and data scientist to perform searches and therefore more likely to be actually used.

Compliance and Regulation

This is related to, and can be used in conjunction with, data classification and security. Applying tags such as ‘GDPR’ or ‘HIPAA’ can make performing audits for regulators much simpler. These tags can be used in conjunction with security tags. In an increasing regulated data environment, it pays to make your data assets easy to regulate.

Project Management and Collaboration

This tagging strategy can be used to organize data assets based on project, teams or departments. This can facilitate project management and improve collaboration by identifying which organizational unit owns or is working with a particular data asset.

Implementation

There are some practical considerations when implementing a tagging program:

  • each securable object has a limit of twenty tags
  • the maximum length of a tag is 255 characters, with no special characters allowed
  • you can only search by using exact match (pattern-matching would have really been nice here)

A well-executed tagging strategy will involve some level of automation. It is possible to manage tags in the Catalog Explorer. This can be a good way to kick the tires in the very beginning but automation is critical for a consistent, comprehensive application of the tagging strategy. Good governance is automated. While tagging is available to all securable objects, you will likely start out applying tags to tables.

The information schema tables will have the tag information. However, Databricks Runtime 13.3 and above allows tag management through SQL commands. This is the preferred mechanism because it is so much easier to use than querying the information schema. Regardless of the mechanism used, a user must have the APPLY TAG privilege on the object, the USE SCHEMA privilege on the object’s parent schema and the USE CATALOG privilege on the object’s parent catalog. This is pretty typical with Unity Catalog’s three-tiered hierarchy. If you are using SQL commands to manage tags, you can use the SET TAGS and UNSET TAGS clauses in the ALTER TABLE command.

You can use a fairly straightforward PySpark script to loop through a set of tables, look for a certain set of column names and then apply tags as appropriate. This can be done as an initial one-time run and then automated by creating a distinct job to check for new tables and/or columns or include in existing ingestion processes. There is a lot to be gained by augmenting this pipeline from just using a script that checks for columns named ‘ssn’ to creating an ML job that looks for fields that contain social security numbers.

Conclusion

I’ve seen a lot of companies struggle with populating their Databricks Lakehouse with sensitive data. In their current state, databases had a very limited set of users, so only people that were authorized to see certain data, like PII, had access to the database that stored this information. However, the utility of a lakehouse is dramatically reduced if you don’t allow sensitive data. In most cases, it just won’t get any enterprise traction. Leveraging all of the governance and security feature of Unity Catalog is a great, if not mandatory, first step. Enhancing governance and security, as well as utility, with tagging is probably going to be necessary to one degree or another in your organization to get broad usage and acceptance.

Contact us to learn more about how to build robustly governed solutions in Databricks for your organization.

]]>
https://blogs.perficient.com/2024/02/29/data-lake-governance-with-tagging-in-databricks-unity-catalog/feed/ 0 357919