DataOps with IBM / Blogs / Perficient

DataOps seeks to deliver high quality data fast in the same way that DevOps delivers high quality code fast. The names are similar; the goals are similar; the implementation is very different. Code quality can be measured using similar tools across multiple projects. Data quality is a mission-critical, enterprise-wide effort. The effort has consistently proven too much for most enterprise and most enterprise data quality and governance initiatives end up being science project. DataOps is a call to action, but the path is not yet clearly defined. IBM DataOps offers both a way of thinking about DevOps as well as tooling for implementation using IBM Global Chief Data Office’s (GCDO) automated metadata generation (AMG) tool and the IBM Watson Knowledge Catalog.

What is it?

DataOps is still a relatively new concept so it makes sense to clearly define the term using a platform-agnostic source.

DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization.

— Gartner

DataOps then is concerned with the steps between data collection and analysis. For IBM, this falls into the catalog and metadata management. This is an ordered process is a data pipeline and is a bottleneck to delivering high quality data to end users:

Data curation and governance
Data Quality and Master Data management
Data integration, replication and visualization
Self-service data preparation and testing

DataOps is a collaboration among data consumers, data engineers and subject matter experts. Implement key performance indicators In order to track improvement over time:

Know your data: Data Inventory KPIs
Trust your data: Data Quality KPIs
Use your data: Data Flow KPIs

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

There is a maturity model that can be measured using these KPIs

	No Data Ops	Foundational	Developed	Advanced
Know	Spreadsheets	Departmental/LOB catalog	Enterprise Catalog	Enforced and enriched catalog
Trust	Emails	Data Quality Program	Data governance program with data stewardship and business glossary	Compliance, business ontology and automated classification
Use	Hand-coding	Data Virtualization, Data Integration and Data Replication	Self-service data prep and test data management	DataOps for all pipelines

How is it implemented?

DevOps has a strong foundation of integrated open source tools. This is no accident; developers write open source tools to make their daily lives easier. DataOps does not have that same ecosystem. The DataOps community is not made up of developers solving general coding problems. The DataOps problems are much more business-centric:

lack of understanding of the data by business users
lack of data governance and data quality
questionable trustworthiness of the data
inability to know what data is available and how to gain access

IBM has two major components to their DataOps offering that seek to minimize the time and human effort involved. The first is the IBM Global Chief Data Office’s (GCDO) automated metadata generation (AMG) tool and the second is the IBM Watson Knowledge Catalog. Metadata is the key to delivering high quality data fast because users can easily find, understand and trust the data they need with accurate and available labeling. Automated metadata management is the key component in DataOps.

The GCDO’s AMG tool provides a series of deep learning models developed on ~60TB of labeled training data. This data is based on public sources, synthetically generated data and anonymized participating client data. Starting with such a comprehensive training set expedites the process of classifying new client data. Classifying data to make it easily discoverable while providing the data stewardship, lineage, and impact analysis to assure it is trustworthy increases both the Know and Trust KPI’s. Providing self-service addressed the Use KPI and this is where IBM Watson Knowledge Catalog comes into the picture.

IBM Watson Knowledge Catalog provides self-discovery of data by providing a graphical interface to access, curate and share data. This is how DataOps is productionalized. The Knowledge Catalog uses machine learning for intelligent discoverability of data sources, model and notebooks. Data lineage and glossaries are provided in the language of the business thanks to the AMG tool. Security feature such as dynamic data masking of sensitive data and automatef scanning and risk assessment of unstructured data using Watson Knowledge Catalog InstaScan allow this tool to be business facing while addressing potential compliance issues.

Where do I go next?

The first step is to identify and quantify the need for DataOps in your organization. What would be the practical impact of getting high quality data fast? Next, you need to be realistic about how many resources can be allocated to DataOps. Maintaining high quality data is not a project; it’s a commitment. Is continuous data classification, cleansing and management something that is better done by resources in your organization or by an AI? The path to quality is the difference between DevOps and DataOps.

DataOps with IBM

by David Callaghan on June 12th, 2020 | ~ minute read

What is it?

Revolutionize Your Business With Generative AI

How is it implemented?

Where do I go next?

Tags

Leave a Reply

David Callaghan, Senior Solutions Architect

Categories

Follow Us