Glue ETL Source

AWS Glue is supported as a data source in Gathr.

An ETL pipeline with Glue data source is supported to run on registered clusters and not on Gathr clusters.

To know about how to register a cluster with Gathr by establishing PrivateLink, see Compute Setup β†’

Prerequisites

When you establish a PrivateLink connection, make sure that the Consumer Role created for Gathr has following permissions to access AWS Glue Catalog:

  • glue:GetSchema

  • glue:GetTables

  • glue:ListSchemas

  • glue:GetDatabases

  • glue:GetTable

  • glue:GetDatabase

Limitations

  • XML format tables are not supported.

  • JSON format tables must have SerDe class as: org.apache.hive.hcatalog.data.JsonSerDe

Schema Type

See the topic Provide Schema for ETL Source β†’ to know how schema details can be provided for data sources.

After providing schema type details, the next step is to configure the data source.

Data Source Configuration

Glue Catalog Data: There are two methods supported by Gathr to fetch data from Glue Catalog during runtime.

Fetch from Glue table: You can select this option to provide the database and table details separately. There is also an option to see the metadata of the Glue table during application design time.

Fetch from Glue Table options

Fetch Metadata: The Fetch Metadata option can be selected to see the metadata of the Glue Catalog table.

The prerequisites for Fetch Metadata to work in Gathr is to update the Glue Catalog Resource Settings.

If preferred option is to fetch metadata, additional fields get displayed:

Region: The region should be provided where the AWS Glue catalog table exists.

AWS Account ID: The AWS account ID should be provided where the AWS Glue catalog table exists.

If you do not prefer to fetch metadata, then proceed by updating the following fields.

Database: Database name inside which the Glue table exists should be provided.

If Fetch Metadata is enabled, then the database can be selected from the drop-down. Otherwise, the database name should be entered.

Table: The Glue table name should be provided from which you need to read data.

If Fetch Metadata is enabled, then the table can be selected from the drop-down. Otherwise, the table name should be entered.

Query: SQL query that is to be executed on the AWS Glue Catalog.

Fetch using query: You can select this option and provide the requirements in an SQL query by specifying database and table names.

Fetch using query options

Query: SQL query that is to be executed should be provided having database and table name on the AWS Glue Catalog.

Glue Catalog Resource Settings

For Gathr to fetch metadata during application design the Glue catalog resource setting should be updated as follows:

1.Β Login to your AWS account and navigate to AWS Glue > Settings.

2.Β In the Permissions section, provide the Gathr principal and other permissions.

Glue_Catalog_Permissions

To know how to get your Gathr Principal (Role ARN), see Gathr Principal.

Provide the following permissions:

  • glue:GetSchema

  • glue:GetTables

  • glue:ListSchemas

  • glue:GetDatabases

  • glue:GetTable

  • glue:GetDatabase

Specify the region and allow Gathr to access catalog, database and table resources for the same.

Detect Schema

Check the populated schema details. For more details, see Schema Preview β†’

Pre Action

To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions β†’).

Notes

Optionally, enter notes in the Notes β†’ tab and save the configuration.

Top