Register as Table Processor

This processor is used for fetching historical data from any streaming or batch source and that data gets registered as a table. This table can be referred further if user has to perform some queries on registered data sources.

User can fetch tables from ADLS, Cassandra, Couchbase, Redshift, HDFS, Hive, JDBC, Snowflake, File, Incoming Message, GCS, BigQuery and S3.

For an example, the user has historical data on HDFS which contains information about different departments of an organization. The user can register that data using Register as Table processor and that registered table can be used in SQL processor to fetch number of employees in an organization.

Register as Table Processor Configuration

To add a Register as Table processor to your pipeline, drag the processor onto the canvas and click on it to configure.

If the data source is selected as ADLS, then there will be the following fields:

Field	Description
Data Source	Select the ADLS source from where historical data is to be read. Available options to fetch tables from are: ADLS, Cassandra, Couchbase, Redshift, HDFS, Hive, JDBC, Snowflake, File, Incoming Message, GCS, BigQuery and S3. If the option selected is incoming message, the output of sources connected before this processor will act as an incoming message for Register As Table Processor. In case of File Data Source, file data will be registered as a table. Internally file will be uploaded in default HDFS.
Table Name	Name with which the table is to be registered.
Connection Name	All available connections are listed here. Select a connection from where data is to be read.
Override Credential	Check this option to override credentials for user specific actions. The below options are available once this option is checked.
Authentication Type	Azure ADLS authentication type.
Account Name	Provide a valid Azure ADLS account name.
Account key	Provide a valid account key.
Client ID	Provide a valid client ID.
Client Secret Password	Provide a valid client secret password.
Directory ID	Provide a valid directory ID. Click TEST CONNECTION button for testing the connection.
Data Format	Historical data format. Available options are: - CSV - JSON - Parquet - ORC - avro - text
Delimiter	Select the delimiter of historical data format.For example, if your historical data of type CSV is separated by comma(,) select delimiter as (,Comma)
Container Name	ADLS container name from which the data should be read.
ADLS Path	Provide directory path for ADLS file system.
Cache Table	Enable caching of registered tables within application memory. If this option is selected then the table will be read only once after it is registered. This option is used to cache the registered table data and can be used for all available data sources. It is helpful in batch pipelines where the data is used at multiple places.
Refresh Cache	Option to enable caching operation.
Refresh Interval	The time interval after which the application’s page is to be reloaded or refreshed.
Max no of rows	Provide maximum number of rows included.
Is Header Included	If the first row of the file is header then mark this field else leave it unmarked.
Post Query	Provide post query. Example: where column=value order by column desc limit 2

There is an option to further register table by clicking the +Register Table button. User can also add Environment Params by clicking at the +ADD PARAM button.

If data source selected is HDFS, there will be following fields:

Field	Description
Data Source	Select the HDFS source from where historical data is to be read. If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table Processor. In case of File Data Source, file data will be registered as a table. Internally file will be uploaded in default HDFS.
Table Name	Specify a name for the table which is to be registered.
Connection Name	All available connections are listed here. Select a connection from where data is to be read.
Data Format	Historical data format.
Delimiter	Select the delimiter of historical data format.For example, if your historical data of type CSV is separated by comma(,) select delimiter as (,Comma)
HDFS Path	HDFS path where data is stored.
Cache Table	If this option is selected then the table will be read only once after it is registered. This option is used to cache the registered table data and can be used for all available data sources. It is helpful in batch pipelines where the data is used at multiple places.
Is Header Included	Select the checkbox if first row of data file is header else leave it unchecked.

If you select the Data Source as HIVE or JDBC, there will be two additional fields:

Database Table Name
Execute Query

Field	Description
Database Table Name	Table from where data will be fetched. If option selected is “Database Table Name”, specify a name for the table.
Execute Query	If option selected is “Execute Query”, you can write your custom query. Output of this query will be stored in current Spark session.

Field

Description

Database Table Name

Table from where data will be fetched.

If option selected is “Database Table Name”, specify a name for the table.

Execute Query

If option selected is “Execute Query”, you can write your custom query.

Output of this query will be stored in current Spark session.

If the user selects Snowflake, the below mentioned field will be additional.

Field	Description
Connection Name	The user will be required to provide the connection name for creating connection.
Warehouse Name	Provide the warehouse name against this column.
Schema Name	Provide the schema name against this column. Note: The user can provide the database table name or provide a query.

Field

Description

Connection Name

The user will be required to provide the connection name for creating connection.

Warehouse Name

Provide the warehouse name against this column.

Schema Name

Provide the schema name against this column.

Note: The user can provide the database table name or provide a query.

If Data Source selected is S3, there will be one additional field:

Bucket Name:

Field	Description
Bucket Name	S3 bucket name from where data will be read.

If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table processor.

You need to specify the name with which table is to be registered after fetching the data.

If you select the Data Source as Cassandra, there will be two additional fields:

KeySpace Name
Cassandra Table Name

Field	Description
KeySpace Name	Cassandra Key Space Name
Cassandra Table Name	Table name inside the keyspace from where we read data.

If you select the Data Source as Couchbase, there will be an additional field:

Bucket Name

Field	Description
Bucket Name	Couchbase Bucket Name.

If you select the Data Source as Redshift, there will be a few additional fields, depending on the two following options:

Database Table Name

Field	Description
Database Table Name	Name of the table from where data is to be fetched.
Max no. of Rows	Specify the maximum number of rows.

Execute Query

Field	Description
Database Table Name	Name of the Redshift table from where data is to be fetched.
Max no. of Rows	Write a custom query. Output of this query is stored in existing Spark session.

Field

Description

Database Table Name

Name of the Redshift table from where data is to be fetched.

Max no. of Rows

Write a custom query.

Output of this query is stored in existing Spark session.

If you have any feedback on Gathr documentation, please email us!

Register as Table Processor

Register as Table Processor Configuration #

Register as Table Processor Configuration