Previously, I discussed machine learning and the traits that separate it from artificial intelligence. This blog analyzes how dirty, or bad data, is the enemy of machine learning.
While entirely accurate and complete data is the goal of a comprehensive data management program, many firms fall somewhat short. Siloed governance projects and the lack of an overall data strategy often result in an inconsistent data quality framework. The necessity of having complete and accurate data has never been more important. Training ML programs requires vast quantities of clean data, as many algorithms, such as neural networks and deep learning, gain accuracy incrementally from each set of data points. Additional clean data is required after training to test the ML models to assess their accuracy.
If a firm does not yet have a comprehensive data management program in place, it’s never too late to start. Given the competitive importance of leveraging ML technologies, firms will need to seek means of cleansing their existing data sets, as required. Within financial services, this would include any and all data that might be used in building and training a predictive ML model, including client, portfolio, market, reference, and master data. The good news is that there are many vendor software products available to identify – and in some cases, repair – suspect data elements or data records. The specific approach or method of analyzing data quality varies by product, so be careful when selecting a product.
It may seem somewhat incongruous, but ML is now being used to cleanse the data required to train other ML predictive applications. A new breed of ML-based data quality tools is emerging that is proving to be highly effective in identifying data omissions and inconsistencies. Machine learning clustering algorithms, such as k-means, provide a visual framework to identify patterns and pockets of data quality issues.
It should be noted, however, that as with most ML applications, the technology is not simply plug-and-play. It takes the hands of talented AI practitioners to determine the model dimensions and number of clusters to be analyzed, as well as to drill into and interpret the results.
Deploying ML-based solutions, or even using ML tools to prepare the associated training data, is not for the uninitiated or faint-of-heart. There is a multitude of different ML algorithms to be aware of, and a key factor in the successful deployment of ML is the ability to select the appropriate algorithm to address each situation. Even once a suitable algorithm is determined, there are numerous parameters that need to be considered for a successful model. Machine learning models can often get “stuck” on so-called “local minima” (based on their non-convex error surface) and produce suboptimal results, or no results at all.
To learn more about the specific differences between AI and ML, dirty data, and ways to take advantage of these technologies you can click here or fill out the form below.