In the Hadoop space we have a number of terms for the Hadoop File System used for data management. Data Lake is probably the most popular. I have heard it called a Data Refinery as well as some other not so mentionable names. The one that has stuck with me has been is the Data Reservoir. Mainly because this most accurate water analogy to what actually happens in a Hadoop implementation that is used for data storage and integration.
Consider, that data is first landed in the Hadoop file system. This is the un-processed data just like water running into a reservoir from different sources. The data in this form in only fit for limited use, like analytics by trained power users. The data is then processed just like water is processed. Process water you end up with water that is consumable. Go one step further and distill it, and you have water that is suitable for medical applications. Data is the same way in a Big Data environment. Process it enough and one ends up with conformed dimensions and fact tables. Process it even more, and you have data that is suitable for basing bonuses or even publishing to government regulators.
Now why spend a blog post on naming something? Actually it is to make a point. The point addresses the fundamental difference between traditional data integration and data integration in the Next Generation Data Architecture. At its heart data is data and data has value. Hadoop allows one not only centralize all analytically significant data, but also all data integration activities. This compares to traditional data integration that transforms data on ETL servers and stores data in data warehouses. In the Big Data world, load the data once. Then use the processing power of the cluster to continually improve the data. If certain data elements are only used for analytical power users, then they do not require much processing. If other data elements are required for reporting, they need standardization and cleaning. Lastly, if data elements are required for regulator purposes then they need to be scored against data quality indicators. Regardless of the quality of data required, Big Data and Hadoop offer one cluster to rule them all.