Big Data Trends - Part 2 ( Stream Processing ) / Blogs / Perficient

In continuation with my previous post on Big Data Trends, I would like to reference couple of Stream processing solutions that are able to process huge volumes of data with low latency. These solutions address the shortcomings of the traditional ETL delivery mechanism.

ETL shortcomings

ETL and message queue systems address offline reporting and batch processing pretty well. Real time data processing needs are handled well with ETL and Message queue systems. The trouble kicks in when we want to scale ETL solutions. For huge volumes of data processing needs with low latency, ETL and messaging systems are often slow and time consuming.

Stream processing

Kafka and Storm are 2 solutions that handle offline and online data processing pretty well. The difference between traditional data processing solutions and stream processing is how they handle and cluster data for faster processing.

These technologies are discussed briefly below.

Build an AI-First Enterprise

From early pilots to enterprise-wide deployment, our award-winning AI consulting and technical services help you build the right foundation, scale responsibly, and deliver meaningful business outcomes.

Learn More

Kafka

Kafka has its origin in LinkedIn where it was predominantly used for its activity streams and operational data processing pipelines which have multiple producers and multiple consumers of data.

“Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by “logging” and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.”

Referenced from : (https://cwiki.apache.org/confluence/display/KAFKA/About+Kafka)

Storm

Storm is a real time computational system processing high volume unbound data streams. There are numerous use cases that can use Storm: realtime analytics online machine learning, ETL.

“Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.”

A storm cluster works very similar to Hadoop but instead of “MapReduce jobs” we would be using “topologies”. Hadoop jobs completes, where topologies never ceases running.

Conclusion

The reason these technologies are critical is because they become the IT backbone and integrate extensively with of the whole message processing systems. These address both offline and online real time data processing managing high volumes with ease. However since traditional ETL solutions and messaging solutions don’t operate well in high volume use cases, enterprises might prefer stream solutions over ETL in the future.

In my next blog, I would be covering some interesting new ways that open source statistical programming has evolved and how a company is bringing crowd sourced data mining solutions to the forefront.

Big Data Trends – Part 2 ( Stream Processing )

by Arunkalyan Nagarajan on November 4th, 2012 | ~ minute read

Build an AI-First Enterprise

Tags

Leave a Reply

Arunkalyan Nagarajan

Categories

Follow Us