What are the key differentiators to determine Hadoop distribution for Big Data analysis on AWS? We have two choices: Amazon EMR or a third-party provided Hadoop (ex: Core Apache Hadoop, Cloudera, MapR etc).
Yes, cost is important. But, aside from cost, other things to look for include ease of operation, controlling, managing, performance,features etc.
Let’s take an example to configure a 4-Node Hadoop cluster in AWS and do a cost comparison.
EMR costs $0.070/h per machine (m3.xlarge), which comes to $2,452.80 for a 4-Node cluster (4 EC2 Instances: 1 master+3 Core nodes) per year. The Same size Amazon EC2 cost $0.266/hour, which comes to $9320.64 per year. Clearly EMR is very cheap compared to core EC2 cluster. Yet, we haven’t added the cost to own a commercial Hadoop distributions (like Cloudera).
Also, Amazon EMR acts as a SaaS (Hadoop managed by Amazon) and it comes with two flavours Amazon Hadoop or MapR Hadoop distribution. But Hadoop on EC2 instances needs to be managed and maintained by the customer.
Straight math: Amazon EMR is a clear winner here. But certain things can be done to control the cost of “Hadoop” EC2 instances.
- Use Reserve EC2 instance. https://aws.amazon.com/ec2/purchasing-options/reserved-instances/
- Mix up with Spot EC2 instance https://aws.amazon.com/ec2/spot/
2. Design considerations
Amazon EMR the storage option is limited to S3. EC2 instances storage options can be expanded to true HDFS. “Instance store” can be used to create EC2 Hadoop Cluster because HDFS will always have redundant copies of data. Hadoop Performance is directly associated to no of disk spindle and it can be increased by adding more number of disks. HDFS is cost efficient for frequent interactive transactions workload because S3 charge customer based on no of requests. It is not cost efficient for frequent interactive workloads or near real time Bigdata analysis.
Hadoop can not directly work with S3 storage because S3 uses blob object to store data. First “Hadoop” will copy the data into temp space using multipart upload and MD5 hash algorithm. Once the job is over it will be uploaded back to S3 using multipart upload.
On the other side “Instance store” storage disk are physically attached to the host computer. Remember the data in an instance store persists only during the lifetime of its associated instance. But this is not an issue because HDFS will always have redundant copies of data. The big advantage with the local disk is practically IO can be random and at is not connected to a network.
Use HVM instances: HVM uses hardware extensions which integrate great host system. HVM are capable of using low latency 10 Gbps network
EC2 instance preferred to be created in the same “Placement group”. “Placement groups” guarantees EC2 instances to be in the same availability zone and hosted within low latency 10 Gig (Gbps) network
Commercial Hadoop distributor like Cloudera, provides a simple installation and configuration and add-on services, e.g., HBase, Flume, Impala, Zookeeper etc. It also comes with Cloudera Manager. It is one of the key differentiators in the market. It manages clusters, software patches across all cluster etc.
Conclusion: AWS EMR and Hadoop on EC2 have both are promising in the market. EC2 Hadoop instances give little more flexible in terms of tuning and controlling, according to the need. Cloudera comes with “Cloudera manager”. It makes operation life easy and transparent. But it comes with a cost. EMR is simple and managed by Amazon.