What are the key differentiators to determine Hadoop distribution for Big Data analysis on AWS? We have two choices: Amazon EMR or a third-party provided Hadoop (ex: Core Apache Hadoop, Cloudera, MapR etc).
Yes, cost is important. But, aside from cost, other things to look for include ease of operation, controlling, managing, performance, features etc.
Let’s take an example to configure a 4-Node Hadoop cluster in AWS and do a cost comparison.
EMR costs $0.070/h per machine (m3.xlarge), which comes to $2,452.80 for a 4-Node cluster (4 EC2 Instances: 1 master+3 Core nodes) per year. The Same size Amazon EC2 cost $0.266/hour, which comes to $9320.64 per year. Clearly EMR is very cheap compared to a core EC2 cluster. Yet, we haven’t added the cost to own a commercial Hadoop distribution (like Cloudera).
Also, Amazon EMR acts as a SaaS (Hadoop managed by Amazon) and it comes with two flavours Amazon Hadoop or MapR Hadoop distribution. But Hadoop on EC2 instances needs to be managed and maintained by the customer.
Straight math: Amazon EMR is a clear winner here. But certain things can be done to control the cost of “Hadoop” EC2 instances.
2. Design considerations
Amazon EMR, the storage option, is limited to S3. EC2 instance storage options can be expanded to true HDFS. “Instance store” can be used to create EC2 Hadoop Clusters because HDFS will always have redundant copies of data. Hadoop Performance is directly associated to the number of disk spindles and it can be increased by increasing the number of disks. HDFS is cost efficient for frequent interactive transactions workload because S3 charges customers based on the number of requests. It is not cost efficient for frequent interactive workloads or near real time big data analysis.
Hadoop cannot directly work with S3 storage because S3 uses a blob object to store data. First “Hadoop” will copy the data into a temporary space using a multipart upload and an MD5 hash algorithm. Once the job is over, it will be uploaded back to S3 using a multipart upload.
Alternatively, the “Instance store” storage disks are physically attached to the host computer. Remember the data in an instance store persists only during the lifetime of its associated instance. But this is not an issue because HDFS will always have redundant copies of data. The big advantage to the local disk is that IO can be random and it is not connected to a network.
Use HVM instances: HVM uses hardware extensions which integrate great to a host system. HVM are capable of using a low latency 10 Gbps network.
The EC2 instance prefers to be created in the same “Placement group.” “Placement groups” guarantee EC2 instances to be in the same availability zone and hosted within alow latency 10 Gig (Gbps) network.
Commercial Hadoop distributors like Cloudera, provide simple installation, configuration and add-on services, e.g., HBase, Flume, Impala, Zookeeper etc. It also comes with Cloudera Manager. It is one of the key differentiators in the market. It manages clusters, software patches across all cluster etc.
Conclusion: AWS EMR and Hadoop on EC2 have both are promising in the market. EC2 Hadoop instances give a little more flexibility in terms of tuning and controlling, according to the need. Cloudera comes with “Cloudera manager”. It makes operations easy and transparent, but it comes with a cost. EMR is simple and managed by Amazon.