Skip to main content

Data & Intelligence

One Cluster To Rule Them All!

In the Hadoop space we have a number of terms for the Hadoop File System used for data management. Data Lake is probably the most popular. I have heard it called a Data Refinery as well as some other not so mentionable names. The one that has stuck with me has been is the Data Reservoir. Mainly because this most accurate water analogy to what actually happens in a Hadoop implementation that is used for data storage and integration.

Consider, that data is first landed in the Hadoop file system. This is the un-processed data just like water running into a reservoir from different sources. The data in this form in only fit for limited use, like analytics by trained power users. The data is then processed just like water is processed. Process water you end up with water that is consumable. Go one step further and distill it, and you have water that is suitable for medical applications. Data is the same way in a Big Data environment. Process it enough and one ends up with conformed dimensions and fact tables. Process it even more, and you have data that is suitable for basing bonuses or even publishing to government regulators.

untitled

Now why spend a blog post on naming something? Actually it is to make a point. The point addresses the fundamental difference between traditional data integration and data integration in the Next Generation Data Architecture. At its heart data is data and data has value. Hadoop allows one not only centralize all analytically significant data, but also all data integration activities. This compares to traditional data integration that transforms data on ETL servers and stores data in data warehouses. In the Big Data world, load the data once. Then use the processing power of the cluster to continually improve the data. If certain data elements are only used for analytical power users, then they do not require much processing. If other data elements are required for reporting, they need standardization and cleaning. Lastly, if data elements are required for regulator purposes then they need to be scored against data quality indicators. Regardless of the quality of data required, Big Data and Hadoop offer one cluster to rule them all.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Bill Busch

Bill is a Director and Senior Data Strategist leading Perficient's Big Data Team. Over his 27 years of professional experience he has helped organizations transform their data management, analytics, and governance tools and practices. As a veteran in analytics, Big Data, data architecture and information governance, he advises executives and enterprise architects on the latest pragmatic information management strategies. He is keenly aware of how to advise and lead companies through developing data strategies, formulating actionable roadmaps, and delivering high-impact solutions. As one of Perficient’s prime thought leaders for Big Data, he provides the visionary direction for Perficient’s Big Data capability development and has led many of our clients largest Data and Cloud transformation programs. Bill is an active blogger and can be followed on Twitter @bigdata73.

More from this Author

Follow Us