Configure ETL Application

On the Pipeline Definition page, you can tailor and refine numerous settings for your ETL application.

These configurations are crucial for defining the behavior of your ETL application during runtime.

Pipeline Name

Please provide a unique name to save the ETL application. This name will be used to save and identify your pipeline.


Application Deployment

Choose where your ETL application will run: either on a Gathr cluster or an EMR cluster.

  • Gathr Cluster: This is the default option. Choosing this means your application will run on a Gathr-managed cluster. Gathr takes care of the cluster infrastructure, ensuring seamless execution of your applications.

  • Registered Cluster: If you prefer to run your applications on clusters managed by you, you can select this option.

    The prerequisite to utilizing registered clusters for running applications is to establish a virtual private connection from the User Settings > Compute Setup → tab.

    To understand the steps for setting up PrivateLink connections, see Compute Setup →


Cluster Size

Choose from a range of cluster sizes to match your computing needs for deploying applications.

Here’s a breakdown of credit point (cp) utilization for each cluster size:

  • Extra Small: 1 credit/min

  • Small: 2 credits/min

  • Medium: 4 credits/min

  • Large: 8 credits/min

Utilize micro cluster if available

The micro cluster option is designed for Extra Small Cluster sizes. It optimizes application submission for small-scale applications by utilizing available free slots on Gathr Compute.


Additional configuration fields for Registered Clusters:

AWS Region

Option to select the preferred region associated with the compute environment.

AWS Account

Option to select the registered AWS account ID associated with the compute environment.

DNS Name

Option to select the DNS name linked to the VPC endpoint for Gathr.

EMR Cluster Config

A saved EMR cluster configuration is to be selected out of the list, or it can be created with the Add New Config for EMR Cluster option.

For more details on how to save EMR cluster configurations in Gathr, see EMR Cluster Configuration →

The application will be deployed on the EMR cluster using the custom configuration that is selected from this field.


Continue with the pipeline definition configuration after providing the deployment preferences.

Auto Restart on Failure

Enable/disable restarting of failed streaming ETL applications.

If Auto Restart on Failure is enabled for the ETL application deployment, additional fields will be displayed as given below:

Max Restart Count

It is required to specify the number of maximum restart count of the ETL application (streaming), in case it fails to run.

Wait time before Restart

The time (in minutes) i.e. the wait duration before the pipeline attempts to auto-restart is to be provided here.

Pending Restart Attempts

The total number of pending restart attempts should be provided here.

If Auto Restart on Failure is disabled, then proceed by updating the following fields.


Store Raw Data in Error Logs

Enable this option to capture raw data coming from corrupt records in error logs along with the error message.


Error Handler

If this option is disabled, the error monitoring graphs will not be visible.


Skip validating connections

Skip validating connections before starting application.


Create Version

This option is visible in case if existing ETL applications are edited and updated. Creates new version for the pipeline. The current version is called the Working Copy and rest of the versions are numbers with n+1.


Extra Spark Submit Options

The configuration provided here will be additionally submitted to spark while running the job. The configuration should strictly be provided in the format given below:

–conf =


Description

Option to write notes specific to the ETL application.


Add Detailed Notes

A modal window opens for the user to add notes.


Save and exit

Once the pipeline deployment configurations are set, save and exit the Pipeline Definition page.

Top