In continuation with my previous post on Big Data Trends, I would like to reference couple of Stream processing solutions that are able to process huge volumes of data with low latency. These solutions address the shortcomings of the traditional ETL delivery mechanism.
ETL shortcomings
ETL and message queue systems address offline reporting and batch processing pretty well. Real time data processing needs are handled well with ETL and Message queue systems. The trouble kicks in when we want to scale ETL solutions. For huge volumes of data processing needs with low latency, ETL and messaging systems are often slow and time consuming.
Stream processing
Kafka and Storm are 2 solutions that handle offline and online data processing pretty well. The difference between traditional data processing solutions and stream processing is how they handle and cluster data for faster processing.
These technologies are discussed briefly below.
Kafka has its origin in LinkedIn where it was predominantly used for its activity streams and operational data processing pipelines which have multiple producers and multiple consumers of data.
“Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by “logging” and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.”
Referenced from : (https://cwiki.apache.org/confluence/display/KAFKA/About+Kafka)
Storm is a real time computational system processing high volume unbound data streams. There are numerous use cases that can use Storm: realtime analytics online machine learning, ETL.
“Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.”
A storm cluster works very similar to Hadoop but instead of “MapReduce jobs” we would be using “topologies”. Hadoop jobs completes, where topologies never ceases running.
Conclusion
The reason these technologies are critical is because they become the IT backbone and integrate extensively with of the whole message processing systems. These address both offline and online real time data processing managing high volumes with ease. However since traditional ETL solutions and messaging solutions don’t operate well in high volume use cases, enterprises might prefer stream solutions over ETL in the future.
In my next blog, I would be covering some interesting new ways that open source statistical programming has evolved and how a company is bringing crowd sourced data mining solutions to the forefront.