In a previous blog posting, I introduced Perficient’s Big Data Stack in order to provide a framework for a Big Data architecture and environment. This article is the second in a series of articles (click here for the first article on the Application tier) focusing on a component of the Big Data Stack. The stack diagram below has two key vertical tiers – the application tier and the infrastructure tier. I will discuss the infrastructure tier in this article.
Figure 1 – Perficient’s Big Data Stack
As mentioned previously, with any data project it is important to identify non-functional requirements (requirements for performance, scalability, availability, backups, etc.). Requirements drive design, and to a large degree should drive infrastructure decisions. You might need more than one Big Data environment to meet the needs of different applications. For example, you might have a mission-critical environment requiring extensive controls or you might have a separate environment to support Data Scientists who are exploring hypotheses requiring more flexibility in the environment.
For the infrastructure tier, there is a dizzying array of technologies and architectures that can be chosen from. Not only do you need to determine what processing platform to choose (Are we going to deploy on public / private cloud? Are we going to utilize proprietary appliances or commodity hardware? How many servers do we need in our Big Data MPP (grid or cluster) environment?), you have to choose how you’re going to persist the data (or you might not choose to persist all the data e.g., for stream processing where the data is only processed in memory and only the information needed long term is persisted to disk), and if you are going to use a relational database or NoSQL technology. If you want to go with a relational database, can it scale into the petabyte range? Will it be able to handle unstructured or semi-structured data efficiently? If you are going to use a NoSQL platform, does it have the security and data management controls necessary? Will you have PHI or other sensitive data in the Big Data environment? Security controls for NoSQL databases aren’t as robust as the relational database so you might need to utilize an IO level encryption technology, such as IBM InfoSphere Guardium Encryption Expert or Gazzanga.
Are you going to use open source or COTS software? Big Data technologies like Hadoop and Cassandra are open sourced, but you might want to buy a tool which will mask a lot of the underlying complexity of tying open source software together, as well as providing analytics and data management capabilities to streamline Big Data efforts. The beauty of Hadoop and Cassandra being able to leverage commodity hardware means that you can significantly drive down the cost of your Big Data environment, as long as you have the skill sets in house to be able to implement open source.
There are over 100 NoSQL databases to choose from for different types of data representation (wide column, key value pair, graph, XML, object, document). Figure 2 below shows the different types of data representation intersected by the types of processing which might be a good fit for the representation type (not limited just to Big Data representation).
Figure 2 – Data Representation by Processing Type
These are just a few of the questions you will need to ask as you consider the infrastructure needed for your Big Data environment. You will want to perform a strategy, roadmap, and readiness assessment for your Big Data program before undertaking infrastructure decisions and acquiring technologies.
I will delve into more detail into these components of Perficient’s Big Data stack in future articles. I’d be interested to hear your feedback on the Big Data Stack and how this compares with your experience. You can reach me at email@example.com, or on twitter at @pstiglich, or in the comments section below.