Skip to main content

Data & Intelligence

DataOps with IBM

digital data cloud

DataOps seeks to deliver high quality data fast in the same way that DevOps delivers high quality code fast. The names are similar; the goals are similar; the implementation is very different. Code quality can be measured using similar tools across multiple projects. Data quality is a mission-critical, enterprise-wide effort. The effort has consistently proven too much for most enterprise and most enterprise data quality and governance initiatives end up being science project. DataOps is a call to action, but the path is not yet clearly defined. IBM DataOps offers both a way of thinking about DevOps as well as tooling for implementation using IBM Global Chief Data Office’s (GCDO) automated metadata generation (AMG) tool and the IBM Watson Knowledge Catalog.

What is it?

DataOps is still a relatively new concept so it makes sense to clearly define the term using a platform-agnostic source.

DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization.

— Gartner

DataOps then is concerned with the steps between data collection and analysis. For IBM, this falls into the catalog and metadata management. This is an ordered process is a data pipeline and is a bottleneck to delivering high quality data to end users:

  1. Data curation and governance
  2. Data Quality and Master Data management
  3. Data integration, replication and visualization
  4. Self-service data preparation and testing

DataOps is a collaboration among data consumers, data engineers and subject matter experts. Implement key performance indicators In order to track improvement over time:

  • Know your data: Data Inventory KPIs
  • Trust your data: Data Quality KPIs
  • Use your data: Data Flow KPIs

There is a maturity model that can be measured using these KPIs

No Data Ops Foundational Developed Advanced
Know Spreadsheets Departmental/LOB catalog Enterprise Catalog Enforced and enriched catalog
Trust Emails Data Quality Program Data governance program with data stewardship and business glossary Compliance, business ontology and automated classification
Use Hand-coding Data Virtualization, Data Integration and Data Replication Self-service data prep and test data management DataOps for all pipelines

How is it implemented?

DevOps has a strong foundation of integrated open source tools. This is no accident; developers write open source tools to make their daily lives easier. DataOps does not have that same ecosystem. The DataOps community is not made up of developers solving general coding problems. The DataOps problems are much more business-centric:

  • lack of understanding of the data by business users
  • lack of data governance and data quality
  • questionable trustworthiness of the data
  • inability to know what data is available and how to gain access

IBM has two major components to their DataOps offering that seek to minimize the time and human effort involved. The first is the IBM Global Chief Data Office’s (GCDO) automated metadata generation (AMG) tool and the second is the IBM Watson Knowledge Catalog. Metadata is the key to delivering high quality data fast because users can easily find, understand and trust the data they need with accurate and available labeling. Automated metadata management is the key component in DataOps.

The GCDO’s AMG tool provides a series of deep learning models developed on ~60TB of labeled training data. This data is based on public sources, synthetically generated data and anonymized participating client data. Starting with such a comprehensive training set expedites the process of classifying new client data. Classifying data to make it easily discoverable while providing the data stewardship, lineage, and impact analysis to assure it is trustworthy increases both the Know and Trust KPI’s. Providing self-service addressed the Use KPI and this is where IBM Watson Knowledge Catalog comes into the picture.

IBM Watson Knowledge Catalog provides self-discovery of data by providing a graphical interface to access, curate and share data. This is how DataOps is productionalized. The Knowledge Catalog uses machine learning for intelligent discoverability of data sources, model and notebooks. Data lineage and glossaries are provided in the language of the business thanks to the AMG tool. Security feature such as dynamic data masking of sensitive data and automatef scanning and risk assessment of unstructured data using  Watson Knowledge Catalog InstaScan allow this tool to be business facing while addressing potential compliance issues.

Where do I go next?

The first step is to identify and quantify the need for DataOps in your organization. What would be the practical impact of getting high quality data fast? Next, you need to be realistic about how many resources can be allocated to DataOps. Maintaining high quality data is not a project; it’s a commitment. Is continuous data classification, cleansing and management something that is better done by resources in your organization or by an AI? The path to quality is the difference between DevOps and DataOps.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us