In my last blog posting, I introduced Perficient’s Big Data Stack in order to provide a framework for a Big Data environment. The stack diagram below has two key vertical tiers – the application tier and the infrastructure tier. I will discuss the application tier in this article.
Figure 1 – Perficient’s Big Data Stack
As with any data project, it is of course important to understand how it will be used (the functional requirements), and equally important to identify non-functional requirements (requirements for performance, scalability, availability, etc.). Requirements drive design, and to a large degree should drive infrastructure decisions. Your Data Scientists need to be actively engaged in identifying requirements for a Big Data application and platform, as they will usually be the primary users.
Most Big Data platforms are geared towards analytics – being able to process, store, and analyze massive amounts of data (hundreds of terabytes, petabytes, and beyond), but Big Data is being used for operations as well (such as patient monitoring, real-time medical device data integration, web intrusion detection, etc.).
Analytics in Big Data is typically not a replacement for your Business Intelligence capabilities. A powerful analytics use case for Big Data is in supporting ad-hoc analyses which might not ever need to be repeated – Data Scientists formulate hypotheses and use a Big Data platform to investigate the hypothesis. Some have the opinion that Big Data is unstructured data. Unstructured data definitely works quite well in a Big Data platform, but if you have massive amounts of structured data (such as medical device, RFID data, or just regular tabular data) – these of course can take advantage of a Big Data platform where you can perform Data Mining inexpensively, even using open source data mining tools such as R and Mahout.
Most of the time you will be sourcing data for your Big Data application and so the Data Sourcing component is at the top of the application tier in the Big Data Stack. There are many tools available for sourcing such as ETL tools, log scanners, streaming message queues, etc. Due to the massive scalability and reliability of Big Data platforms such as Hadoop and Cassandra (NoSQL technologies), such platforms may be an ideal place to archive ALL of your data online. With these technologies, each data block is automatically replicated on multiple machines (or even multiple data centers in the case of Cassandra). Failure of the nodes in the cluster or grid is expected and so node failures can be handled gracefully and automatically. Having all of your archived data available in a single environment online can provide a rich environment for your data scientists.
We’ve talked briefly about operational applications for a Big Data platform – but much more can be said. Transaction processing can be supported on Big Data, but usually not ACID compliant transaction processing (HBase, built on top of Hadoop, touts that it can support ACID transactions). Not all types of transactions require ACID compliance – e.g., if a user updates his/her Facebook status and the status is lost due to a failure, it’s not the end of the world. Some operational applications might not persist all of the data – and might just distribute the data across the nodes to utilize computing capacity and distributed memory.
I will delve into more detail into these components of Perficient’s Big Data stack in future articles. I’d be interested to hear your feedback on the Big Data Stack and how this compares with your experience. You can reach me at pete.stiglich@perficient.com or by leaving a comment in the section below.