Skip to main content

Data & Intelligence

The New Data Integration Paradigm

Data integration has changed.  The old way of extracting data, moving it to a new server, A little stuffed animal called Hadooptransforming it, and then loading into a new system for reporting and analytics is now looking quite arcane. It’s expensive, time consuming, and does not scale to handle the volumes we are now seeing in the digitally transformed enterprise.

We saw this coming, with push down optimization and the early incarnations of Extract Load and Transform (ELT). Both of these architectural solutions were used to address scalability.

Hadoop has taken this to the next step where the whole basis of Hadoop is to process the data where it is stored.  Actually, this is bigger than Hadoop. The movement to cloud data integration will require the processing to be completed where the data is stored as well.

To understand how a solution may scale in a Hadoop or cloud centric architecture, one will need to understand where processing happens with regards to where the data is stored. To do this, one needs to ask vendors three questions:

  1. When is data moved off of the cluster? — Clearly understand when is data required to be moved off of the cluster. In general, the only time we should be moving data is to move to a downstream operational system to be consumed. Another way to put this, data should not be moved to an ETL or Data Integration server for processing then moved back to the Hadoop cluster.
  2. When is data moved from the data node? — Evaluate which functions require data to be moved off of the data node to a name or resource manager.   Tools that utilize Hive are of particular concern since anything that is pushed to Hive for processing will inherit the limitations of Hive. Earlier versions of Hive required data to be moved through the name node for processing.   Although Hive has made great strides pushing processing to the data nodes, there still are limitations.
  3. On the data node, when is data moved into memory?  — Within a Hadoop data node disk I/O is still a limiting factor.  Technologies that require data to be written to disk after a task is completed can quickly become I/O bound on the data node. Other solutions load data into all data into memory before processing may not scale to higher volumes.

Of course there is much more to be evaluated, however, choosing technologies that keep processing close to the data, instead of moving data to the processing will smooth the transition to the next generation architecture.   Follow Bill on Twitter @bigdata73

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Bill Busch

Bill is a Director and Senior Data Strategist leading Perficient's Big Data Team. Over his 27 years of professional experience he has helped organizations transform their data management, analytics, and governance tools and practices. As a veteran in analytics, Big Data, data architecture and information governance, he advises executives and enterprise architects on the latest pragmatic information management strategies. He is keenly aware of how to advise and lead companies through developing data strategies, formulating actionable roadmaps, and delivering high-impact solutions. As one of Perficient’s prime thought leaders for Big Data, he provides the visionary direction for Perficient’s Big Data capability development and has led many of our clients largest Data and Cloud transformation programs. Bill is an active blogger and can be followed on Twitter @bigdata73.

More from this Author

Follow Us