Glue ETL Source
AWS Glue is supported as a data source in Gathr.
An ETL pipeline with Glue data source is supported to run on registered clusters and not on Gathr clusters.
To know about how to register a cluster with Gathr by establishing PrivateLink, see Compute Setup β
Prerequisites
When you establish a PrivateLink connection, make sure that the Consumer Role created for Gathr has following permissions to access AWS Glue Catalog:
glue:GetSchema
glue:GetTables
glue:ListSchemas
glue:GetDatabases
glue:GetTable
glue:GetDatabase
Limitations
XML format tables are not supported.
JSON format tables must have SerDe class as:
org.apache.hive.hcatalog.data.JsonSerDe
Schema Type
See the topic Provide Schema for ETL Source β to know how schema details can be provided for data sources.
After providing schema type details, the next step is to configure the data source.
Data Source Configuration
Glue Catalog Data: There are two methods supported by Gathr to fetch data from Glue Catalog during runtime.
Fetch from Glue table: You can select this option to provide the database and table details separately. There is also an option to see the metadata of the Glue table during application design time.
Fetch from Glue Table options
Fetch Metadata: The Fetch Metadata option can be selected to see the metadata of the Glue Catalog table.
The prerequisites for Fetch Metadata to work in Gathr is to update the Glue Catalog Resource Settings.
If preferred option is to fetch metadata, additional fields get displayed:
Region: The region should be provided where the AWS Glue catalog table exists.
AWS Account ID: The AWS account ID should be provided where the AWS Glue catalog table exists.
If you do not prefer to fetch metadata, then proceed by updating the following fields.
Database: Database name inside which the Glue table exists should be provided.
If Fetch Metadata is enabled, then the database can be selected from the drop-down. Otherwise, the database name should be entered.
Table: The Glue table name should be provided from which you need to read data.
If Fetch Metadata is enabled, then the table can be selected from the drop-down. Otherwise, the table name should be entered.
Query: SQL query that is to be executed on the AWS Glue Catalog.
Fetch using query: You can select this option and provide the requirements in an SQL query by specifying database and table names.
Fetch using query options
Query: SQL query that is to be executed should be provided having database and table name on the AWS Glue Catalog.
Glue Catalog Resource Settings
For Gathr to fetch metadata during application design the Glue catalog resource setting should be updated as follows:
1.Β Login to your AWS account and navigate to AWS Glue > Settings.
2.Β In the Permissions section, provide the Gathr principal and other permissions.
To know how to get your Gathr Principal (Role ARN), see Gathr Principal.
Provide the following permissions:
glue:GetSchema
glue:GetTables
glue:ListSchemas
glue:GetDatabases
glue:GetTable
glue:GetDatabase
Specify the region and allow Gathr to access catalog, database and table resources for the same.
Detect Schema
Check the populated schema details. For more details, see Schema Preview β
Pre Action
To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions β).
Notes
Optionally, enter notes in the Notes β tab and save the configuration.
If you have any feedback on Gathr documentation, please email us!