Big Data presents a lot of new opportunities – from analytics on petabytes of data to Complex Event Processing (CEP) of many large streams of data in real time. However, Big Data also presents big challenges for Data Management. These challenges include security/privacy, governance, data modeling, and backups.
Security – security for many NoSQL-based Big Data platforms are not as mature as relational counterparts, with much security seeming to take place at the firewall level. Privacy, especially for PHI, is especially problematic. Solutions such as IBM Guardium Encryption Expert or Gazzanga look promising as they perform encryption/decryption at the OS level below the filesystem level, and so can be transparent to the application. Tokenization is another alternative where referential integrity in the data can be maintained without exposing private information.
Data Governance – Another complex twist is that with the massive amount of data available is that privacy becomes more difficult, even beyond protecting PHI or credit card numbers. Intellectual property or potentially damaging information which might have been stored in unstructured data sources can be exposed more easily. With a massive amount of data points, inferences can be made (e.g., using Semantic Web technologies) to uncover knowledge previously much more difficult to piece together. Data Governance and Data Stewardship must be involved when exploiting Big Data technologies to ensure that information is properly secured and used, according to business requirements.
Metadata Management – many metadata management solutions have not yet developed scanners or busses to import/exchange metadata in order to support data lineage or impact analysis to support Data Governance and change management. However, more and more data management tools (e.g., Informatica for data integration, Composite for data virtualization) are starting to support Big Data technologies, such as Hadoop, and so metadata may be exposed using these tools, though probably not with the same ease as with relational databases given the “schema-less” nature of many NoSQL databases.
Data Modeling – given the “schema-less” nature of NoSQL databases, the data modeling task will morph, primarily Physical Data Modeling (as DDL doesn’t need to be generated). However, Conceptual and Logical Data Modeling will remain important. With the advent of the Semantic Web, ontology modeling will become much more prevalent. I call ontology modeling Conceptual Data Modeling on steroids as it can inform machine learning and inferencing.
Performing backups on a Big Data platform can be problematic as backup solutions are in their infancy for these types of solutions. One can easily imagine the problems of performing backups when many terabytes are loaded every hour or day. However, many big data platforms have built in availability. With Hadoop HDFS for example, each data block is replicated on three nodes (by default) and is rack aware so that a block is stored on at least two different racks. With Cassandra, data can be replicated across Data Centers. This does not, of course, preclude the possibility of data being deleted.
Big Data presents a paradigm shift and will require adjustment in data management practices. Not every new system will require a Big Data solution or NoSQL platform, but where required, careful consideration of data management requirements is necessary.