Traditionally, in our information architectures we have a number of staging or intermediate data storage areas / systems. These have taken different forms over the years, publish directories on source systems, staging areas in data warehouses, data vaults, or most commonly, data file hubs. In general, these data file staging solutions have suffered from two limitations:
- Because of costs storage, data retention was usually limited to a few months.
- Since these systems were intended to publish data for system integration purposes, end-users generally did not have access to staging data for analytics or data discovery
Hadoop’s systems reduce the cost per terabyte of storage by two orders of magnitude. Data that consumed $100 worth of data storage now costs $1 to store on a Hadoop system. This radical reduction cost, enables enterprises to replace sourcing hubs with data lakes based on Hadoop, where a data lake can now house years of data vs. only a few months.
Next, once in a Hadoop filesystem (HDFS) the data can be published, either directly to tools that consume HDFS data or to Hive (or other SQL like interface). This enables end-users to leverage analytical, data discovery, and visualization tools to derive value from data within Hadoop.
The simple fact that these data lakes can now retain historical data and provide scalable access for analytics, also has a profound effect on the data warehouse. This effect on the data warehouse will be the subject of my next few blogs.