Big Data has generated a lot of interest in the media and in industry, leading to the possible impression that every data problem is a “Big Data” problem. However, the amount of interest is justified given the performance and scalability boost possible and the economic feasibility of Big Data platforms enabled by commodity hardware clusters/grids and open source Big Data databases, processing platforms, and technologies (e.g., Hadoop, Cassandra, HBase, MongoDB – to name a few).
To help you have a better understanding of what is involved in Big Data and provide a framework for your Big Data initiatives, I’d like to present Perficient’s Big Data Stack. I hope this will help you to peel back the layers upon layers of complexity. Big Data is very powerful – but not necessarily easy, and it represents a significant paradigm shift from the traditional relational database (if leveraging a NoSQL database for your Big Data platform).
The stack diagram below is divided horizontally into two categories – Technology and Roles/Organization. I will talk more on the technology component later. A key role in Big Data is that of “Data Scientist.” These are people with statistics, data modeling, data mining, and programming experience. Data Scientists look for gold in massive amounts of data, and can present their findings in a comprehensible manner to management and others. Governance needs to be involved in Big Data – both Data and Application Governance, and there are other roles involved as well such as data and system architects, developers, and system administrators.
Figure 1 – Perficient’s Big Data Stack
The Big Data Stack is also divided vertically between Application and Infrastructure, as there is a significant infrastructure component to Big Data platforms, and of course the importance of identifying, developing, and sustaining applications which are good candidates for a Big Data solution is important.
Below I will provide a high level overview of each of the technical stack components:
Category | Component | Description |
Application | Data Sourcing | Most Big Data applications will require data that is sourced from other databases and interfaces, and so this is the first core component in the stack |
Application / Processing Type | Analytics | Advanced analytics is a common application which may indicate need for a Big Data solution. |
Application / Processing Type | Operations | Big Data can support operational needs as well, e.g., real-time Complex Event Processing for patient monitoring, risk management, etc. |
Application | Distributed Processing | Distributed processing is at the core of Big Data processing where you execute a task on many computers in a cluster/grid. |
Infrastructure | Representation | There are many ways that data can be represented in a Big Data platform, e.g., wide column stores, key value pairs, graph, relational, etc. |
Infrastructure | Persistence | Indicates how data will be persisted – or if it will be in the Big Data platform. When processing massive streams of real-time data, you might not need to persist all the data – just grab and process what you need. You can use a NoSQL, Distributed Filesystem, or MPP RDBMS’s for Big Data persistence. |
Infrastructure | Platform | You can use open-source or proprietary software & databases, commodity or proprietary hardware. Leveraging a public / private cloud is an option as well. |
Management | Security | Security of course is important, especially in healthcare setting. If you are going to put PHI data in a Big Data platform, you will want to look at low-level encryption capabilities that perform encryption at the IO level (as data is be written to/read from disk). Security is not as mature and robust in NoSQL platforms. |
Management | Development and Management | There are a wide array of open source and proprietary development and management technologies which will be part of the Big Data equation, e.g., schedulers, load balancers, etc. |
I will delve into more detail into these components of Perficient’s Big Data stack in future articles. I’d be interested to hear your feedback on the Big Data Stack and how this compares with your experience. You can reach me at pete.stiglich@perficient.com or in the comments section below.