What is broke? If I drive a pickup truck around that has a small, unobtrusive crack in the windshield and a few dings in the paint, it will still pull a boat and haul a bunch of lumber from Home Depot. Is the pickup broke if it still meets my needs?
So, when is data broke? In our legacy data integration practices, we would profile data and identify all that is wrong with the data. Orphan keys, in-appropriate values, and incomplete data (to name a few) would be identified before data was moved. In the more stringent organizations data would need to near perfect for it to be used in a data warehouse. This ideal world or perfect data was strived after, but rarely obtained. It was too expensive, required too much business buy in, and lengthen BI and DW projects.
In the world of Big Data, things have changed. We move data first and some of it we may never fix. Why? To understand this we need to look at the analytical process. When performing an analytical project, data scientists will usually select a subset of data, split it into halves. One half of the data is used to build a model; the other is used to test the model. If the model tests OK, that is the standard error is within acceptable range, do we need to fix the data? Fixing the data would at this point not change the outcome, so it served its purpose.
With moving the data first and moving it into a data lake for processing this gives us a unique opportunity to test drive the data first. Data scientists and business users will be able to benefit from using the data to make better decisions. At a time that the data quality does not meet the needs, address the issues within the data. So, don’t fix it if it ain’t broke.
Follow Bill on Twitter @bigdata73
Connect with Perficient on LinkedIn here.