Configure ETL Application
On the Pipeline Definition page, you can tailor and refine numerous settings for your ETL application.
These configurations are crucial for defining the behavior of your ETL application during runtime.
Pipeline Name
Please provide a unique name to save the ETL application. This name will be used to save and identify your pipeline. ETL applications must commence with a letter and can incorporate alphanumeric symbols and special characters like !@$-;:()-_?=~/*<>’ for naming.
Application Deployment
Choose where your ETL application will run: either on a Gathr cluster or an EMR cluster.
Gathr Cluster: This is the default option. Choosing this means your application will run on a Gathr-managed cluster. Gathr takes care of the cluster infrastructure, ensuring seamless execution of your applications.
Registered Cluster: If you prefer to run your applications on clusters managed by you, you can select this option.
The prerequisite to utilizing registered clusters for running applications is to establish a virtual private connection from the User Settings > Compute Setup → tab.
To understand the steps for setting up PrivateLink connections, see Compute Setup →
Cluster Size
Choose from a range of cluster sizes to match your computing needs for deploying applications.
Here’s a breakdown of credit point (cp) utilization for each cluster size:
Extra Small: 1 credit/min
Small: 2 credits/min
Medium: 4 credits/min
Large: 8 credits/min
GPU - Powered by NVIDIA RAPIDS: 10 credits/min
Also, a custom cluster can only be utilized with a registered compute environment that is available in a Business Plan.
Deploy on Reserved Cluster
Prioritize using a reserved cluster to run your ETL pipeline.
Use Reserved Cluster
Available options are:
Always: Only leverage the reserved cluster to run the job. If the maximum resources are already utilized, the pipeline deployment will wait for the next available slot.
When Free Slots are Available: Utilize the reserved cluster only if a slot is available. Otherwise, launch an extra small cluster.
Additional configuration fields for Registered Clusters with AWS compute setup:
AWS Region
Option to select the preferred region associated with the compute environment.
AWS Account
Option to select the registered AWS account ID associated with the compute environment.
DNS Name
Option to select the DNS name linked to the VPC endpoint for Gathr.
EMR Cluster Config
A saved EMR cluster configuration is to be selected out of the list, or it can be created with the Add New Config for EMR Cluster option.
For more details on how to save EMR cluster configurations in Gathr, see EMR Cluster Configuration →
The application will be deployed on the EMR cluster using the custom configuration that is selected from this field.
Continue with the pipeline definition configuration after providing the deployment preferences.
Skip validating connections
Skip validating connections before starting application.
Auto Restart on Failure
Enable/disable restarting of failed streaming ETL applications.
If Auto Restart on Failure is enabled for the ETL application deployment, additional fields will be displayed as given below:
Max Restart Count
It is required to specify the number of maximum restart count of the ETL application (streaming), in case it fails to run.
Wait time before Restart
The time (in minutes) i.e. the wait duration before the pipeline attempts to auto-restart is to be provided here.
Pending Restart Attempts
The total number of pending restart attempts should be provided here.
If Auto Restart on Failure is disabled, then proceed by updating the following fields.
Store Raw Data in Error Logs
Enable this option to capture raw data coming from corrupt records in error logs along with the error message.
Error Handler
If this option is disabled, the error monitoring graphs will not be visible.
Create Version
This option is visible in case if existing ETL applications are edited and updated. Creates new version for the pipeline. The current version is called the Working Copy and rest of the versions are numbers with n+1.
Description
Option to write notes specific to the ETL application.
Add Detailed Notes
A modal window opens for the user to add notes.
Extra Spark Submit Options
The configuration provided here will be additionally submitted to spark while running the job. The configuration should strictly be provided in the format given below:
–conf
Save and exit
Once the pipeline deployment configurations are set, save and exit the Pipeline Definition page.
If you have any feedback on Gathr documentation, please email us!