In the war of cloud computing, AWS is currently the undisputed leader but Microsoft Azure seems to be chipping away at AWS’ lead with every passing quarter. While Google Cloud offers impressive tools as well, it is struggling to keep pace with Azure. This could be traced to Microsoft’s longer tenure in the enterprise market. To better understand why Azure is preferred over AWS & GCP by Enterprises, I decided to take a deeper dive into various technologies available in Azure.
It is clear from the merger of Cloudera & HortonWorks that the days of Hadoop & its related technologies such as MapReduce, Hive etc. are numbered. Ingesting HUGE amounts of data on the Cloud can now be accomplished without writing a lot of code by using Azure Data Factory Version 2. Storing HUGE amount of data that can be queried using SQL in just a matter of seconds now no longer needs setting up a cluster of machines that run Cassandra, HBase or some such tool. Azure SQL Data Warehouse makes this quite easy. (Although, I must say the speed of ASDW pales in comparison to Google’s BigQuery but I will save that topic for some other day!). It appears that these two Azure tools will enable enterprises to port most of their ETL use cases on to Azure.
My main focus, though, was on running Machine Learning algorithms available in Microsoft Azure Machine Learning Studio in real-time. For my experiment I decided to run ‘Sentimental Analysis on the Twitter Stream’. I was impressed with the speed at which I was able to set this up on Azure. There was very little coding I had to do as the ‘Sentimental Analysis’ API was already made available in the Studio. I was able to do most of the setup using GUI.
If you would like to run this experiment on your own, you can follow the steps given in this README.md: https://github.com/ajaychitre/azure-tools/blob/master/README.md
The source code is available on GitHub at: https://github.com/ajaychitre/azure-tools
Here’s a summary of steps:
- Create an Event Hub & send tweets to it.
- Create a Web Service. The ML Studio makes this easy. Of course, if you want to write your own algorithm you need to allocate extra time for this.
- Create a Stream Analytics Job. You can write output to many different stores. I chose ‘Azure Cosmos DB’. (Surprisingly, you cannot write to ASDW at this time!). The ‘Input’ query for this Stream Analytics job looks like this:
Note: sentiment(text) calls the Web Service. This potentially could become a bottleneck if the incoming data is coming in at a very high speed. This needs to be stress tested!
- Once Stream Analytics Job is started, you can start querying the Cosmos DB by using queries such as this:
We live in the world of server-less computing. In this experiment, I never had to setup either Elastic or Non-Elastic Cluster. Whatever I needed I was able to request via GUI & was able to acquire within minutes. There was no need to setup a cluster of Kafka brokers ’cause Azure Event Hub took care of that need. On the receiving end, I didn’t need to setup a Spark Cluster & write a Spark Streaming job. Azure Streaming Analytics job handled most of the intricacies related to that. The server-less nature of computing allows Data Engineers to spend more time solving critical business problems and less time on fixing environment related issues!
All in all, I am quite impressed with the tools available in Azure. Of course, there’s a lot more one can do. One can send output of the Streaming Job to Power BI either directly or indirectly to create cool graphs in real-time. The possibilities are endless!