2 Choices for Big Data Analysis on AWS: Amazon EMR or Hadoop on EC2 / Blogs / Perficient

What are the key differentiators to determine Hadoop distribution for Big Data analysis on AWS? We have two choices: Amazon EMR or a third-party provided Hadoop (ex: Core Apache Hadoop, Cloudera, MapR etc).

Yes, cost is important. But, aside from cost, other things to look for include ease of operation, controlling, managing, performance, features etc.

1. Cost

Let’s take an example to configure a 4-Node Hadoop cluster in AWS and do a cost comparison.

EMR costs $0.070/h per machine (m3.xlarge), which comes to $2,452.80 for a 4-Node cluster (4 EC2 Instances: 1 master+3 Core nodes) per year. The Same size Amazon EC2 cost $0.266/hour, which comes to $9320.64 per year. Clearly EMR is very cheap compared to a core EC2 cluster. Yet, we haven’t added the cost to own a commercial Hadoop distribution (like Cloudera).

Also, Amazon EMR acts as a SaaS (Hadoop managed by Amazon) and it comes with two flavours Amazon Hadoop or MapR Hadoop distribution. But Hadoop on EC2 instances needs to be managed and maintained by the customer.

Straight math: Amazon EMR is a clear winner here. But certain things can be done to control the cost of “Hadoop” EC2 instances.

Use Reserve EC2 instance
Mix it up with Spot EC2 instance

2. Design considerations

Amazon EMR, the storage option, is limited to S3. EC2 instance storage options can be expanded to true HDFS. “Instance store” can be used to create EC2 Hadoop Clusters because HDFS will always have redundant copies of data. Hadoop Performance is directly associated to the number of disk spindles and it can be increased by increasing the number of disks. HDFS is cost efficient for frequent interactive transactions workload because S3 charges customers based on the number of requests. It is not cost efficient for frequent interactive workloads or near real time big data analysis.

Revolutionize Your Business With Generative AI

From product design and software development to virtual agents, content creation, and reporting, GenAI is transforming business. Our AI experts help you unlock GenAI’s full potential and drive growth.

Let’s Get Started

Hadoop cannot directly work with S3 storage because S3 uses a blob object to store data. First “Hadoop” will copy the data into a temporary space using a multipart upload and an MD5 hash algorithm. Once the job is over, it will be uploaded back to S3 using a multipart upload.

Alternatively, the “Instance store” storage disks are physically attached to the host computer. Remember the data in an instance store persists only during the lifetime of its associated instance. But this is not an issue because HDFS will always have redundant copies of data. The big advantage to the local disk is that IO can be random and it is not connected to a network.

Use HVM instances: HVM uses hardware extensions which integrate great to a host system. HVM are capable of using a low latency 10 Gbps network.

The EC2 instance prefers to be created in the same “Placement group.” “Placement groups” guarantee EC2 instances to be in the same availability zone and hosted within alow latency 10 Gig (Gbps) network.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

Commercial Hadoop distributors like Cloudera, provide simple installation, configuration and add-on services, e.g., HBase, Flume, Impala, Zookeeper etc. It also comes with Cloudera Manager. It is one of the key differentiators in the market. It manages clusters, software patches across all cluster etc.

Conclusion: AWS EMR and Hadoop on EC2 have both are promising in the market. EC2 Hadoop instances give a little more flexibility in terms of tuning and controlling, according to the need. Cloudera comes with “Cloudera manager”. It makes operations easy and transparent, but it comes with a cost. EMR is simple and managed by Amazon.

References:

http://shop.oreilly.com/product/0636920033448.do

http://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_aws.pdf

http://aws.amazon.com/ec2/spot/pricing

Thoughts on “2 Choices for Big Data Analysis on AWS: Amazon EMR or Hadoop on EC2”

amazon web services online training December 29, 2016 at 12:35 am

Thanks for sharing this- good stuff! Keep up the great work, we look forward to reading more from you in the future!
GCP TRAINING February 21, 2024 at 11:04 pm

thanks for valuable info
gcp training in hyderabad

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Choices for Big Data Analysis on AWS: Amazon EMR or Hadoop on EC2

by Milan Das on May 19th, 2016 | ~ minute read

1. Cost

2. Design considerations

Revolutionize Your Business With Generative AI

Tags

Thoughts on “2 Choices for Big Data Analysis on AWS: Amazon EMR or Hadoop on EC2”

Leave a Reply

Milan Das

Categories

Follow Us