Data integration has changed. The old way of extracting data, moving it to a new server, transforming it, and then loading into a new system for reporting and analytics is now looking quite arcane. It’s expensive, time consuming, and does not scale to handle the volumes we are now seeing in the digitally transformed enterprise.
We saw this coming, with push down optimization and the early incarnations of Extract Load and Transform (ELT). Both of these architectural solutions were used to address scalability.
Hadoop has taken this to the next step where the whole basis of Hadoop is to process the data where it is stored. Actually, this is bigger than Hadoop. The movement to cloud data integration will require the processing to be completed where the data is stored as well.
To understand how a solution may scale in a Hadoop or cloud centric architecture, one will need to understand where processing happens with regards to where the data is stored. To do this, one needs to ask vendors three questions:
- When is data moved off of the cluster? — Clearly understand when is data required to be moved off of the cluster. In general, the only time we should be moving data is to move to a downstream operational system to be consumed. Another way to put this, data should not be moved to an ETL or Data Integration server for processing then moved back to the Hadoop cluster.
- When is data moved from the data node? — Evaluate which functions require data to be moved off of the data node to a name or resource manager. Tools that utilize Hive are of particular concern since anything that is pushed to Hive for processing will inherit the limitations of Hive. Earlier versions of Hive required data to be moved through the name node for processing. Although Hive has made great strides pushing processing to the data nodes, there still are limitations.
- On the data node, when is data moved into memory? — Within a Hadoop data node disk I/O is still a limiting factor. Technologies that require data to be written to disk after a task is completed can quickly become I/O bound on the data node. Other solutions load data into all data into memory before processing may not scale to higher volumes.
Of course there is much more to be evaluated, however, choosing technologies that keep processing close to the data, instead of moving data to the processing will smooth the transition to the next generation architecture. Follow Bill on Twitter @bigdata73