EMR Cluster Configuration

Amazon EMR is an analytics service offered by AWS. Gathr provides the interface to create EMR cluster configurations that can be consumed to run applications or data assets on the EMR clusters as per the set configuration.

Prerequisites

The following prerequisites must be fulfilled before creating EMR Cluster configurations via Gathr.

  • An active AWS account with necessary permissions to create EMR clusters.

  • A compute environment should be set up in Gathr using the VPC endpoint service.

  • An EMR cluster should be available in the AWS account, that you have associated with Gathr during compute environment setup.

  • (Optional) Make sure that you have an Amazon EC2 key pair that you can use to authenticate the EMR cluster using SSH protocol.

Create EMR Cluster Configuration

The steps to create an EMR cluster configuration are same for data ingestion, CDC, ETL applications, and data assets.

Change_Cluster

The option to add new EMR cluster configuration can be accessed by following ways:

  • During the design phase of an application or a data asset, the last step is to save the job. The EMR Cluster Config option is available along with other fields when the application/data asset deployment preference is set to registered cluster.

  • From the applications/data assets listing page, you can get the option to Change Cluster.

Using any of the above ways you can get to the Add New Config for EMR Cluster page.

The configuration options for adding a new EMR cluster are explained below:

Create_Custom_Cluster

Cluster Configuration ID: Unique name to identify the EMR cluster configuration should be provided.

Subnet ID: Subnet ID from the drop-down list should be selected.

EMR Master Security Group: Master security group option(s) should be selected for the EMR cluster configuration.

EMR Slave Security Group: Slave security group option(s) should be selected for the EMR cluster configuration.

EMR Role: EMR role to be assigned to the cluster should be selected.

EC2 Instance Profile: EC2 instance profile should be selected from the drop-down.

Auto Scaling Role: Auto scaling role to be assigned to the cluster should be selected.

Tags: Customized tags can be added for the EMR cluster.

Custom Cluster Specific Configuration

The fields described below are specific to custom cluster creation:

Root EBS Volume (GB): Option to attach additional storage space to the server in form of virtual hard disks. Storage space required should be specified in GB.

To know more, refer to the link: Specifying Root EBS Volume

EMR Managed Scaling

Option to adjust the number of Amazon EC2 instances available to an EMR cluster.

EMR will automatically increase and decrease the number of instances in core and task nodes based on the workload. Master nodes do not scale.

Minimum Units: Provide the minimum number of core or task units allowed in a cluster. Minimum value is 1.

Maximum Units: Provide the maximum number of core or task units allowed in a cluster. Minimum value is 1.

On-Demand Limit: Provide the maximum allowed core or task units for On-Demand market type in a cluster. If this parameter is not specified, it defaults to maximum units value. Minimum value is 0.

Maximum Core Units: Provide the maximum allowed core nodes in a cluster. If this parameter is not specified, it defaults to maximum units value. Minimum value is 1.

To know more, refer to the link: EMR Managed Scaling

Configuration Tab

Configuration: You can provide customized configuration to override the default configurations for an application. Configuration provided should be in JSON format and should contain classification, properties, and optional nested configurations.

Configuration Example:

[{"classification":"yarn-site","properties":{"yarn.resourcemanager.scheduler.class":"org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"},"configurations":[]}]

To know more, refer to the link: Configure Applications

Master Nodes, Core Nodes and Task Nodes

Create_Custom_Cluster_Master_Nodes

All three tabs have similar fields as described below:

Instance Type: The type of instance having some predetermined hardware configurations should be selected for all types of Nodes based on your computing load.

To know more, refer to the link : Configure Amazon EC2 Instances

Instance Count: The number of instances that you require for core and task nodes should be set.

EBS Volume: EBS storage required per instance should be specified.

Volumes Per Instance: The block level storage volumes required for use with each EC2 instance should be specified.

Node Type: The purchasing option should be chosen out of On-demand or Spot.

To know more about when to use On-demand instances, refer to the link: On-Demand Instances

To know more about when to use Spot instances, refer to the link: Spot Instances

Spot Bid Price: Bidding price per instance should be set in dollars ($). The value provided will be used as a maximum price per spot instance/hour.

The fields described below are for custom automatic scaling and will be available on Core and Task Nodes tab if you have not opted for EMR managed scaling.

To know more, refer to the link: Using Custom Automatic Scaling

If Autoscaling is enabled on Core or Task Nodes tab, additional fields will be appear as described below:

Core_Task_Nodes_Auto_Scaling

Minimum Nodes: Provide the minimum possible number of EC2 instances for your instance group.

Maximum Nodes: Provide the maximum possible number of EC2 instances for your instance group.

Please provide at least one of Scale in or Scale Out parameters or Both.

Scale Out Rules

Rule Name: A name for the scale out rule should be provided.

Add: Provide the number of EC2 instances to be added each time the autoscaling rule is triggered.

if: Choose the AWS CloudWatch metric that should be used to trigger autoscaling.

is: Enter the threshold value and condition for the CloudWatch metric selected above.

for: Enter the number of consecutive five-minute periods over which the metric data will be compared to the threshold. Autoscaling will be triggered if the condition is met for each consecutive period.

Cooldown period: The time specified will be the cool-down time taken to start the next scaling activity after an ongoing scaling activity is completed.

ADD RULE: Click to add additional scale out rules.

Scale In Rules

Rule Name: A name for the scale in rule should be provided.

Terminate: Provide the number of EC2 instances to be terminated each time the autoscaling rule is triggered.

if: Choose the AWS CloudWatch metric that should be used to trigger autoscaling.

is: Enter the threshold value and condition for the CloudWatch metric selected above.

for: Enter the number of consecutive five-minute periods over which the metric data will be compared to the threshold. Autoscaling will be triggered if the condition is met for each consecutive period.

Cooldown period: The time specified will be the cool-down time taken to start the next scaling activity after an ongoing scaling activity is completed.

ADD RULE: Click to add additional scale in rules.

SSH Tab

Create_Custom_Cluster_SSH

EC2 Key Pair Name: This option enables users to access the EMR cluster securely using an EC2 Key Pair.

Bootstrap Actions Tab

Create_Custom_Cluster_Bootstrap_Actions

S3 Path: The path of the script file that you need to run while bootstrapping of the cluster should be provided.

Example: s3://GathrSaaS/bootstrapscripts/xyz.sh

Once all the required fields are set, continue to create the EMR cluster.

Edit EMR Cluster Configuration

The cluster configurations that you have saved in Gathr will appear in the drop-down list for the EMR cluster config field.

Edit_Cluster

Click on edit icon to open the Edit EMR Configuration page, do the necessary changes and update the EMR cluster configuration.

Delete EMR Cluster Configuration

The cluster configurations that you have saved in Gathr will appear in the drop-down list for the EMR cluster config field.

Cluster_Edit_or_Delete

Click on the edit icon to open the Edit EMR Configuration page.

The option to delete the cluster configuration will be available at the bottom of the page.

Top