DataOps seeks to deliver high quality data fast in the same way that DevOps delivers high quality code fast. The names are similar; the goals are similar; the implementation is very different. Code quality can be measured using similar tools across multiple projects. Data quality is a mission-critical, enterprise-wide effort. The effort has consistently proven too much for most enterprise and most enterprise data quality and governance initiatives end up being science project. DataOps is a call to action, but the path is not yet clearly defined. IBM DataOps offers both a way of thinking about DevOps as well as tooling for implementation using IBM Global Chief Data Office’s (GCDO) automated metadata generation (AMG) tool and the IBM Watson Knowledge Catalog.
What is it?
DataOps is still a relatively new concept so it makes sense to clearly define the term using a platform-agnostic source.
DataOps is a collaborative data management practice focused on improving the communication, integration and automation of data flows between data managers and data consumers across an organization.
— Gartner
DataOps then is concerned with the steps between data collection and analysis. For IBM, this falls into the catalog and metadata management. This is an ordered process is a data pipeline and is a bottleneck to delivering high quality data to end users:
- Data curation and governance
- Data Quality and Master Data management
- Data integration, replication and visualization
- Self-service data preparation and testing
DataOps is a collaboration among data consumers, data engineers and subject matter experts. Implement key performance indicators In order to track improvement over time:
- Know your data: Data Inventory KPIs
- Trust your data: Data Quality KPIs
- Use your data: Data Flow KPIs
There is a maturity model that can be measured using these KPIs
No Data Ops | Foundational | Developed | Advanced | |
Know | Spreadsheets | Departmental/LOB catalog | Enterprise Catalog | Enforced and enriched catalog |
Trust | Emails | Data Quality Program | Data governance program with data stewardship and business glossary | Compliance, business ontology and automated classification |
Use | Hand-coding | Data Virtualization, Data Integration and Data Replication | Self-service data prep and test data management | DataOps for all pipelines |
How is it implemented?
DevOps has a strong foundation of integrated open source tools. This is no accident; developers write open source tools to make their daily lives easier. DataOps does not have that same ecosystem. The DataOps community is not made up of developers solving general coding problems. The DataOps problems are much more business-centric:
- lack of understanding of the data by business users
- lack of data governance and data quality
- questionable trustworthiness of the data
- inability to know what data is available and how to gain access
IBM has two major components to their DataOps offering that seek to minimize the time and human effort involved. The first is the IBM Global Chief Data Office’s (GCDO) automated metadata generation (AMG) tool and the second is the IBM Watson Knowledge Catalog. Metadata is the key to delivering high quality data fast because users can easily find, understand and trust the data they need with accurate and available labeling. Automated metadata management is the key component in DataOps.
The GCDO’s AMG tool provides a series of deep learning models developed on ~60TB of labeled training data. This data is based on public sources, synthetically generated data and anonymized participating client data. Starting with such a comprehensive training set expedites the process of classifying new client data. Classifying data to make it easily discoverable while providing the data stewardship, lineage, and impact analysis to assure it is trustworthy increases both the Know and Trust KPI’s. Providing self-service addressed the Use KPI and this is where IBM Watson Knowledge Catalog comes into the picture.
IBM Watson Knowledge Catalog provides self-discovery of data by providing a graphical interface to access, curate and share data. This is how DataOps is productionalized. The Knowledge Catalog uses machine learning for intelligent discoverability of data sources, model and notebooks. Data lineage and glossaries are provided in the language of the business thanks to the AMG tool. Security feature such as dynamic data masking of sensitive data and automatef scanning and risk assessment of unstructured data using Watson Knowledge Catalog InstaScan allow this tool to be business facing while addressing potential compliance issues.
Where do I go next?
The first step is to identify and quantify the need for DataOps in your organization. What would be the practical impact of getting high quality data fast? Next, you need to be realistic about how many resources can be allocated to DataOps. Maintaining high quality data is not a project; it’s a commitment. Is continuous data classification, cleansing and management something that is better done by resources in your organization or by an AI? The path to quality is the difference between DevOps and DataOps.