Welcome to “Big Data and You (the enterprise IT leader),” the Enterprise Content Intelligence group’s demystification of the “Big Data” . The often missing piece of the Infrastructure as code movement emerging from the DevOps space is what we think of as DataOps. Big Data technologies are uniquely poised to fill this gap because they are designed for flexibility and pipeline automation from the start. At the same time there are more options for cell level security, data lineage tracking and end-to-end governance with record disposition and destruction.
There are really two sides to Data Ops tooling; the tools for exploring and examining the data and the tools for securing and tracking the data. These operational prerogatives are often at odds. Where DevOps is a strategy to increase collaboration between Development and IT Operations to benefit Agile management, DataOps is a strategy to increase collaboration between data consumers, data controllers and data producers to benefit Agile management. Improved collaboration is where the technology can help.
We can approach this in three areas
- Pipeline development and automation (data producers)
- Data access and exploration (data consumers)
- Data security and tracking (data controllers)
Pipeline development in the world of Big Data revolves around a few different tools for different stages of the process. For retrieving data and bringing it into the data lake you can use tools like NiFi and Gremlin for batch work and Storm and Kafka for transactional loads. These tools tend to handle orchestration internally and are highly dedicated to their purpose. Once in the data lake you start to work with tools like Pig, Hive and native MapReduce. These tools are great for detailed data manipulation and batch analysis, but need outside help from tools like Falcon and Oozie to handle pipeline and workflow orchestration, respectively. As always, once the data is in the data lake, we no longer think of it as transactional. When there is a need for even higher speed or real-time pipeline activities you call in Spark. Spark has greater flexibility when it comes to language support (Python, Scala, R) and does most of its work in memory, shaving time off the run. You may even wish to use Kafka as an event queue so data is stored in multiple stores as each system gets around to processing it. This way you can keep your transactional master up-to-date in real time, while letting the data lake and Lucene systems lag by a few minutes in a reliable way. For transactional systems that need to remain consistent, you’ll want to use Cassandra or HBase for transactional storage and pull data off into the data lake as needed for batch jobs. The right combination of these tools, with a creatively designed control system or rules engine can lead to impressive flexibility while maintaining automation at runtime.
For Data access and exploration we rely heavily on our friend Spark. Support for multiple easy to use high level languages makes Spark an excellent tool for data exploration. With the notebook applications Jupyter and Zepplin a data scientist can come into a Spark environment and access all the richness stored in the data lake with familiar R or python based visualization techniques all within a web browser. For application based data access for reporting or fixed analytics there are even more options. For some use-cases a JDBC connection to Hive will suffice (where speed is of little concern and freshness is not important). When we care about analytic performance, we might want to create Lucene indexes out of our data using Elasticsearch or Solr. These indexes allow you to interrogate any facet of the data in real time for reporting and analytics purposes. Data can be added on a schedule or in real time and results will update as new data is available.
Data security and governance can be accomplished using a combination of strategies and technologies depending on the requirement. Security starts with Authentication and Authorization. Most of the systems we’ve discussed have some native capabilities. Mostly they are based on the Kerberos protocol or require LDAP integration for the authentication piece and have some sort of extensible interface for authorization. Tools like Ranger allow us to control access to a variety of systems (HDFS, Hive, HBase, Solr, Kafka, etc.). This degree of central control allows us to explore new kinds of access control rules, like Ranger’s new classification, time and geo based policies. There is a movement in the Hadoop community toward better solutions for governance as well. These efforts are embodied in the Atlas project (and it’s integration with Hive, Falcon, Storm, Ranger and other components). The combination of Atlas and Ranger is the first step toward true records management on the data lake.
Many of these technologies, Atlas, Spark, Zepplin, etc., are fairly new and still evolving. They are built on a strong foundation in Lucene, Hadoop and their friends, but there is still a long way to go.