Recently, at Informatica World 2019, I heard the importance of data platform in building AI capabilities for the organization. What is interesting is that Informatica, known for their products delivering the “Switzerland of Data”, is now using AI capabilities to enhance their own suite of products with CLAIRE capabilities. In further exploring a few other articles on the importance of Data, I also came across Monica Rogati’s Data Science Hierarchy of Needs and was impressed by the way she relates the AI structure to Maslow’s Hierarchy of Needs.
In a way, the “self-actualization” that Maslow defines as “achieving one’s full potential” is the AI capability. However, to get there, you need the basics of data platform foundation. Now an important distinction between Monica Rogati’s Data Science Hierarchy and my pyramid structure is the assumption that you would use the capabilities from software products such as Informatica which offers you GUI-based capabilities where you can focus more time on governance, analysis, and quality and less time on writing custom coding. So please consider that as you are reading this article.
Data Platform Path
FIND
It’s paramount to identify and clearly define the “use case” that the AI team is going after. Without a meaningful use case, just building machine learning and automation for the sake of exploration doesn’t provide any value. Once the use case is defined, find where the data resides in the enterprise or outside the enterprise (benchmark, 3rd party, etc.)
COLLECT
With commercial and open source tools available in the data marketplace, you can quickly build data integration to collect real-time or batch data into a data lake. Don’t overthink quality of data at this point.
UNDERSTAND
Once you collect data into a data lake, understand the data you collected by profiling the datasets and mapping them back to your use case. You can also define tags in your data to put a business context of your datasets. In addition, take effort to classify the data you collected into categories that make meaningful business sense.
INTEGRATE & TRANSFORM
Once you tag and classify your datasets, integrate data from multiple sources into one data model that can support your defined use cases. In some cases, this can also be enhancement of your existing data model to support multiple use cases.
ENRICH
Integration should also include data enrichment. So many open datasets such as weather, traffic patterns, currency, disaster, health conditions are available for the public to consume. In addition, third party datasets such as Dun & Bradstreet can help validate customer addresses.
SCALE
It’s clear that to integrate such large, disparate datasets and build data models out of those datasets, your cloud or on premise data platform should be able to perform at scale. So use performance tuning and storage/compute techniques that will provide on-time results.
EXPERIENCE
Good quality data doesn’t mean anything without showing results in a format that can be consumed by different audience levels (line level to executives). Reporting platforms such as Power BI, Tableau, and Microstrategy have been market leaders for a reason with their ability to build beautiful visualizations with streaming or large batch datasets. Hence large cloud vendors such as Salesforce have been acquiring BI companies like Tableau to enhance their visualization.
Defining Metrics
One other important factor is to define the metrics and measures clearly to take actions based on facts.
MONITOR
Building the data platform is not a one time activity. Data similar to infrastructure needs continuous monitoring and improvement based on feedback from business subject matter experts (SME) who also act as data SMEs. Therefore, as you build your data platform use monitoring services and build notifications and alerts based on thresholds driven by business needs. Additionally, you can rate your data based on the relevance of the datasets to your decision making process. This will improve the quality of the data that is important for the organization. This activity will also improve prioritizing critical datasets over others similar to putting tighter SLA’s on important systems and their recovery procedures.
AI & DEEP LEARN
All the steps above will lead into building Machine Learning algorithms and automation processes that will provide relevant opportunities and direct impact to your organization’s bottom line.
While the above sequence of events will manage your data throughout the lifecycle of data preparation, data security and data governance play a key role to manage the data lifecycle as well. In addition, Dev Ops will provide agility to building data platform to keep the business moving and changing as mergers and acquisitions dominate the current landscape.