Edit: Part 2 (setup) : Part 3 (Mahout)
The internet is becoming increasingly personalized. It has transitioned from indexing massive wells of information to delivering personalized information, or recommendations based on complex searches. Evidence of this is seen in Google’s Knowledge graph, Amazon, the Bing engine, Facebook friends and twitter recommending people you may be interesting in following. Recommendations are everywhere on the web and with the introduction of HDInsight on Windows Azure the personalized web will grow even larger. HDInsight is an implementation of Apache Hadoop running natively within Windows Server. Hadoop is a very powerful distributed computing solution that can process massive quantities of data.
Incorporating “non-Microsoft” technologies baked into Microsoft based services and products is a newer development. The benefits to the IT professional are infinite. Let us take HDInsight as an example. For those not familiar with Linux and installing Hadoop on a distribution of clustered nodes the process can be frustrating and time consuming (to say the least). There are many guides on line and each guide pertains to its own flavor of Linux (Gentoo vs. Red Hat vs. Ubuntu vs. CentOS etc.). The process has gotten better over the years but is still quite cumbersome. To create a Hadoop cluster within Windows Azure, simply create an HDInsight cluster from the dashboard. In a few minutes you have a fully functional Hadoop cluster ready for processing.
You may be asking yourself; “Hadoop is a distributed computing system, what does it have to do with recommendations?”. Mahout is the answer. Mahout is an open source machine learning engine that is also managed by Apache. It contains many different types of algorithms and features, but one of its most prominent is its recommendation engine. The installation process is trivial so you will have Mahout up and running in an HDInsight cluster in no time. To install Mahout on your cluster download the latest release in zip file format from the Mahout website. Copy the zip file to your one of your cluster nodes and extract the contents to C:\apps\dist. That’s it! Not only have you just installed Mahout, but you have also deployed it to your Hadoop cluster.
Next I will walk through the installation process and use Mahout to process data. – Update: Part 2 is here.