EMR Cluster Configuration

Amazon EMR is an analytics service offered by AWS. Gathr provides the interface to create EMR cluster configurations that can be consumed to run applications or data assets on the EMR clusters as per the set configuration.

Prerequisites

The following prerequisites must be fulfilled before creating EMR Cluster configurations via Gathr.

An active AWS account with necessary permissions to create EMR clusters.
A compute environment should be set up in Gathr using the VPC endpoint service.
An EMR cluster should be available in the AWS account, that you have associated with Gathr during compute environment setup.
(Optional) Make sure that you have an Amazon EC2 key pair that you can use to authenticate the EMR cluster using SSH protocol.

Create EMR Cluster Configuration

The steps to create an EMR cluster configuration are same for data ingestion, CDC, ETL applications, and data assets.

👉

Make sure that the Application Deployment configuration property is set to Registered Cluster.

The option to add new EMR cluster configuration can be accessed by following ways:

During the design phase of an application or a data asset, the last step is to save the job. The EMR Cluster Config option is available along with other fields when the application/data asset deployment preference is set to registered cluster.
From the applications/data assets listing page, you can get the option to Change Cluster.

Using any of the above ways you can get to the Add New Config for EMR Cluster page.

The configuration options for adding a new EMR cluster are explained below:

👉

The Subnet ID, Master and Slave Security Groups should be same as provided in the endpoints created for Gathr in your AWS account.

👉

The fields described below are standard for all the cluster sizes, i.e., small, medium, large, and custom.

Cluster Configuration ID: Unique name to identify the EMR cluster configuration should be provided.

Subnet ID: Subnet ID from the drop-down list should be selected.

EMR Master Security Group: Master security group option(s) should be selected for the EMR cluster configuration.

EMR Slave Security Group: Slave security group option(s) should be selected for the EMR cluster configuration.

EMR Role: EMR role to be assigned to the cluster should be selected.

EC2 Instance Profile: EC2 instance profile should be selected from the drop-down.

Auto Scaling Role: Auto scaling role to be assigned to the cluster should be selected.

Tags: Customized tags can be added for the EMR cluster.

Custom Cluster Specific Configuration

The fields described below are specific to custom cluster creation:

Root EBS Volume (GB): Option to attach additional storage space to the server in form of virtual hard disks. Storage space required should be specified in GB.

To know more, refer to the link: Specifying Root EBS Volume

EMR Managed Scaling

Option to adjust the number of Amazon EC2 instances available to an EMR cluster.

EMR will automatically increase and decrease the number of instances in core and task nodes based on the workload. Master nodes do not scale.

Minimum Units: Provide the minimum number of core or task units allowed in a cluster. Minimum value is 1.

Maximum Units: Provide the maximum number of core or task units allowed in a cluster. Minimum value is 1.

On-Demand Limit: Provide the maximum allowed core or task units for On-Demand market type in a cluster. If this parameter is not specified, it defaults to maximum units value. Minimum value is 0.

Maximum Core Units: Provide the maximum allowed core nodes in a cluster. If this parameter is not specified, it defaults to maximum units value. Minimum value is 1.

To know more, refer to the link: EMR Managed Scaling

👉

If EMR managed scaling is not enabled here, an additional option for custom automatic scaling will be available on the Core and Task Nodes tab.

Configuration Tab

Configuration: You can provide customized configuration to override the default configurations for an application. Configuration provided should be in JSON format and should contain classification, properties, and optional nested configurations.

Configuration Example:

[{"classification":"yarn-site","properties":{"yarn.resourcemanager.scheduler.class":"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"},"configurations":[]}]

To know more, refer to the link: Configure Applications

Master Nodes, Core Nodes and Task Nodes

All three tabs have similar fields as described below:

Instance Type: The type of instance having some predetermined hardware configurations should be selected for all types of Nodes based on your computing load.

To know more, refer to the link : Configure Amazon EC2 Instances

Instance Count: The number of instances that you require for core and task nodes should be set.

EBS Volume: EBS storage required per instance should be specified.

Volumes Per Instance: The block level storage volumes required for use with each EC2 instance should be specified.

Node Type: The purchasing option should be chosen out of On-demand or Spot.

To know more about when to use On-demand instances, refer to the link: On-Demand Instances

To know more about when to use Spot instances, refer to the link: Spot Instances

Spot Bid Price: Bidding price per instance should be set in dollars ($). The value provided will be used as a maximum price per spot instance/hour.

The fields described below are for custom automatic scaling and will be available on Core and Task Nodes tab if you have not opted for EMR managed scaling.

To know more, refer to the link: Using Custom Automatic Scaling

If Autoscaling is enabled on Core or Task Nodes tab, additional fields will be appear as described below:

Minimum Nodes: Provide the minimum possible number of EC2 instances for your instance group.

Maximum Nodes: Provide the maximum possible number of EC2 instances for your instance group.

👉

Maximum Nodes cannot be left empty if custom autoscaling is enabled.

Please provide at least one of Scale in or Scale Out parameters or Both.

Scale Out Rules

Rule Name: A name for the scale out rule should be provided.

Add: Provide the number of EC2 instances to be added each time the autoscaling rule is triggered.

if: Choose the AWS CloudWatch metric that should be used to trigger autoscaling.

is: Enter the threshold value and condition for the CloudWatch metric selected above.

for: Enter the number of consecutive five-minute periods over which the metric data will be compared to the threshold. Autoscaling will be triggered if the condition is met for each consecutive period.

Cooldown period: The time specified will be the cool-down time taken to start the next scaling activity after an ongoing scaling activity is completed.

ADD RULE: Click to add additional scale out rules.

Scale In Rules

Rule Name: A name for the scale in rule should be provided.

Terminate: Provide the number of EC2 instances to be terminated each time the autoscaling rule is triggered.

if: Choose the AWS CloudWatch metric that should be used to trigger autoscaling.

is: Enter the threshold value and condition for the CloudWatch metric selected above.

Cooldown period: The time specified will be the cool-down time taken to start the next scaling activity after an ongoing scaling activity is completed.

ADD RULE: Click to add additional scale in rules.

SSH Tab

EC2 Key Pair Name: This option enables users to access the EMR cluster securely using an EC2 Key Pair.

Bootstrap Actions Tab

S3 Path: The path of the script file that you need to run while bootstrapping of the cluster should be provided.

Example: s3://GathrSaaS/bootstrapscripts/xyz.sh

Once all the required fields are set, continue to create the EMR cluster.

Edit EMR Cluster Configuration

The cluster configurations that you have saved in Gathr will appear in the drop-down list for the EMR cluster config field.

Click on edit icon to open the Edit EMR Configuration page, do the necessary changes and update the EMR cluster configuration.

Delete EMR Cluster Configuration

The cluster configurations that you have saved in Gathr will appear in the drop-down list for the EMR cluster config field.

Click on the edit icon to open the Edit EMR Configuration page.

The option to delete the cluster configuration will be available at the bottom of the page.

If you have any feedback on Gathr documentation, please email us!

EMR Cluster Configuration

Prerequisites #

Create EMR Cluster Configuration #

Custom Cluster Specific Configuration #

Edit EMR Cluster Configuration #

Delete EMR Cluster Configuration #

Prerequisites

Create EMR Cluster Configuration

Custom Cluster Specific Configuration

Edit EMR Cluster Configuration

Delete EMR Cluster Configuration