What is EMR:
- EMR is an expandable, low-configuration service that provides an alternative to running on-premises cluster computing.
- Amazon EMR is based on Apache Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
- MapReduce, a core component of the Hadoop software framework.
- developers can write programs that process massive amounts of unstructured data across a distributed cluster of processors or standalone computers.
- The Elastic in EMR’s name refers to its dynamic resizing ability, which enables administrators to increase or reduce resources, depending on their current needs.
Uses in EMR :
- Amazon EMR is a web service that makes it easy to Quickly and effectively process vast amounts of data using Hadoop.
- Amazon EMR distributes the data and processing across a resizable cluster of the Amazon EC2 instances.
- With Amazon EMR we can launch a persistent cluster that stays up indefinitely or a temp cluster that terminates after the analysis Is completed.
- Amazon EMR supports a variety of Amazon EC2 instance types.
- When launching an Amazon cluster also called a “job flow” we can choose how many and what type of Amazon instances to provision.
- The Amazon EMR price is in addition to the Amazon EC2 Price.
MapReduce Organizes
Hadoop divides the job into tasks. There are two types of tasks:
- 1. Map tasks (Splits & Mapping)
- 2. Reduce tasks (Shuffling, Reducing)
- The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called tasks.
- 1. Job tracker: Acts like a master (responsible for complete execution of submitted job)
- 2. Multiple Task Trackers: Acts like slaves, each of them performing the job
- For every job submitted for execution in the system, there is one Job tracker that resides on the Name node and there are multiple task trackers which reside on the Data node.
Benefits of Using EMR
-
Easy to use:
- Since clusters are launched in minutes we can perform Infrastructure setup, node provisioning, cluster tuning, and Hadoop configuration.
- All these tasks are taken care of by EMR so that we can concentrate on analysis. It also allows the team and individuals to interactively explore, visualize and process the data.
-
Low cost:
- The pricing of EMR is simple as well as predictable.
- We can use applications like Apache Hive and Apache Spark to launch a 10-node EMR cluster for a low cost of $0.15 per hour.
- If we are using services like Amazon S3, DynamoDB, or Amazon Kinesis along with your EMR cluster, they will be charged separately from the usage for Amazon EMR.
-
Elasticity
- EMR allows the provisioning of not one but thousands of compute instances for processing data at any scale.
- With the help of Auto Scaling which can manage the size of clusters based on utilization.
- Decouples persistent storage and compute which gives us the ability to scale every one of them independently.
-
Reliability
- EMT monitors your cluster constantly. They retry failed tasks and replace poorly performed instances automatically.
- With the help of multiple master nodes, clusters are not only highly available but also failover in case of a node failure automatically.
- Amazon EMR, we have a configuration option for controlling the termination of our cluster, whether we do it manually or automatically. If we go for the option of automatic termination, the cluster will be terminated once the steps are completed. This is known as a transient cluster.
- If we go for the manual option, the cluster will continue to run even after the processing is completed. we will have to manually terminate it when you no longer need it. The other option is creating a cluster, interacting directly with the installed applications, and then manually terminating the cluster. These are known as long-running clusters.
- There is an option of configuring the termination protection to prevent the clusters’ instances from being terminated due to issues and errors during processing. This allows the recovery of instances’ data before they are terminated.
-
Security
- EMR is responsible for automatically configuring the firewall settings of EC2. these setting control the instances’ network access and launches the clusters in an Amazon VPC.
- We can either use our own customer-managed keys or the AWS Key Management Service. With the help of EMR, we can easily enable other encryption options like at-rest and in-transit encryption.
-
IAM:
- .The defined permissions determine the actions the members of the group or the users can perform and accessible resources.
-
Security Groups:
- Security groups are used by Amazon EMR for controlling outbound and inbound traffic to the EC2 instances.
- There is an option for configuring additional security groups and assigning them to the master as well as task/core instances for advanced rules.
-
Encryption:
- Amazon S3 client-side and server-side encryption along with EMRFS is supported by the Amazon EMR
- We can use the AWS Key Management Service to manage the master key for the client-side encryption.
-
Amazon VPC:
- A VPC is a virtual network isolated in the AWS providing the ability to control network access and configuration’s advanced aspects.
-
AWS CloudTrail:
- This information can be used to track who is accessing the cluster and when. It can even determine the IP address that made the request.
-
Amazon EC2 Key Pairs:
- A secure connection needs to be formed between the master node and your remote computer for monitoring and interacting with the cluster. For the connection, you can use the Secure Shell (SSH) network and for authentication, you can use Kerberos.
-
Flexibility
- This involves easy installation of additional applications, having root access to every instance, and customizing every cluster with bootstrap actions.
- Also, you have the option of scaling up or down your clusters according to your computing needs. You can remove instances for controlling costs when peak workloads. subside or add instances for peak overloads by resizing your clusters.
- Amazon EMR also allows running multiple instance groups so that on-demand instances can be used in a single group for processing power with spot instances in another group. This helps faster completion of jobs at a lower price.
- Amazon EMR offers the flexibility of using different file systems for your input, intermediate, and output data.
- (HDFS) for running the core and master nodes of your cluster to process that is not required after the lifecycle of the cluster.
- (EMRFS) for using Amazon S3 as a data layer to run applications on the cluster for separating the storage and compute and persist data after the lifecycle of the cluster.
-
AWS Integration
- Integrating Amazon EMR with other services offered by AWS can help in providing functionalities and capabilities of networking, security, storage, and many more.
- EC2, VPC, S3, IAM, CloudTrail, Data pipeline, cloud Trail, CloudWatch
-
Deployment
- When we are launching the cluster, the instances with the applications like Apache Spark or Apache Hadoop are configured by the Amazon EMR.
- Installation of several MapR distributions. Amazon Linus is used for the manual installation of the software on the cluster. For this, the yum package manager can be used.
-
Monitoring
- There is also the capability of archiving log files in Amazon S3 for storing logs and troubleshooting issues even after the cluster has been terminated.
- CloudWatch is integrated with Amazon EMR for tracking performance metrics for the cluster as well as the jobs within the cluster.
EMR use cases:
Amazon EMR deployment options:
EMR Pricing
- You pay a per-second rate for every second for each node you use, with a one-minute minimum.
- Its pricing is in addition to the EC2 price (the price for the underlying servers) and EBS price (if attaching EBS volumes).
Amazon EMR Security
- EMR integrates with IAM to manage permissions. You define permissions using IAM policies, which you attach to IAM users or IAM groups. The permissions that you define in the policy determine the actions that those users or members of the group can perform and the resources that they can access.
- EMR uses IAM roles for the EMR service itself and the EC2 instance profile for the instances. These roles grant permissions for the service and instances to access other AWS services on your behalf. There is a default role for the EMR service and a default role for the EC2 instance profile.
EMR Notebooks
- A serverless Jupyter notebook.
- An EMR cluster is required to execute the code and queries within an EMR notebook, but the notebook is not locked to the cluster.
- Runs Apache Spark.
Difference between AWS EMR vs Cloudera:
Thank you!!