Operational systems frequently have to have data archived (i.e., backed up and purged from the database) in order to maintain performance SLA’s. Historical data is frequently maintained for a much longer period in data warehouses, but as a data warehouse is intended for storing integrated, standardized, and historical (and perhaps current) data in order to support cross-functional analysis and reporting – a data warehouse can’t truly be considered an archive platform where the original data is stored as it looked like in the operational source system.
Record retention policies and laws often specify that business records be retained in exactly the format in which the business record was originally recorded. We can, of course, archive our data off to tape and then consider our archiving complete.
However, an alternative which may be considered is to use a Big Data cluster/grid to store all of our archived business records, whether structured or unstructured in nature (database records, contract images, etc.). I will point out a couple of reasons for considering this:
1) Online retention of all data – if the data can be legally retained indefinitely (sometimes you can’t retain records past a certain point e.g., juvenile court records), you can have rapid access to all of your historical data in its original format. As soon as you purge the data from your system, someone will think of a reason why they want or need to look at or analyze that data. Keeping it on-line means that you can minimize the amount of time required to make that information re-accessible. For example, legal discovery can be performed more rapidly if all the historical data is retained. Your data scientists and analysts will also be very happy by having all of this historical data available for data mining and investigation.
2) Big Data platforms such as Hadoop or Cassandra can store any kind of file in a distributed, redundant fashion, thereby having fault tolerance built into your cluster (homogeneous compute nodes) or grid (heterogeneous compute nodes). By default in Hadoop, each data block is replicated onto three nodes, though this is configurable. Cassandra can operate across multiple data centers to enable even greater fault tolerance. In a Big Data platform, low-cost commodity servers and DAS (Direct Attached Storage) can drive down the cost of retaining all of this information. As a result, archival off to tape might not be a mandatory step in the archive process.
It should go without saying that retaining the metadata describing your business records is essential, being able to account for structural changes over time. One option for representing structured business records from a database in the Big Data archival platform is to convert the transactions to XML so that the structure and relationships can be encapsulated and leveraged for the long term, without having to reload the data back into the operational system to be able to leverage it. This will of course increase the amount of space required to store this data, but given the benefits this can be a reasonable tradeoff. Of course, be sure that your XSD’s describing the XML files are retained and secured!
Data Archival is represented in the Big Data stack in the Data Sourcing component (as existing archives might serve as a data source for the Big Data platform) though archival might also fall into the Operations component as well to support business record retention, legal discovery, and other types of applications.
Figure 1 – Perficient’s Big Data Stack
I am interested to hear your feedback and experience of using Big Data clusters/grids as an archival platform. You can reach me at pete.stiglich@perficient.com, on twitter at @pstiglich, or in the comments section below.