Register as Table Processor
In this article
This processor is used for fetching historical data from any streaming or batch source and that data gets registered as a table. This table can be referred further if user has to perform some queries on registered data sources.
User can fetch tables from ADLS, Cassandra, Couchbase, Redshift, HDFS, Hive, JDBC, Snowflake, File, Incoming Message, GCS, BigQuery and S3.
For an example, the user has historical data on HDFS which contains information about different departments of an organization. The user can register that data using Register as Table processor and that registered table can be used in SQL processor to fetch number of employees in an organization.
Register as Table Processor Configuration
To add a Register as Table processor to your pipeline, drag the processor onto the canvas and click on it to configure.
If the data source is selected as ADLS, then there will be the following fields:
Field | Description |
---|---|
Data Source | Select the ADLS source from where historical data is to be read. Available options to fetch tables from are: ADLS, Cassandra, Couchbase, Redshift, HDFS, Hive, JDBC, Snowflake, File, Incoming Message, GCS, BigQuery and S3. If the option selected is incoming message, the output of sources connected before this processor will act as an incoming message for Register As Table Processor. In case of File Data Source, file data will be registered as a table. Internally file will be uploaded in default HDFS. |
Table Name | Name with which the table is to be registered. |
Connection Name | All available connections are listed here. Select a connection from where data is to be read. |
Override Credential | Check this option to override credentials for user specific actions. The below options are available once this option is checked. |
Authentication Type | Azure ADLS authentication type. |
Account Name | Provide a valid Azure ADLS account name. |
Account key | Provide a valid account key. |
Client ID | Provide a valid client ID. |
Client Secret Password | Provide a valid client secret password. |
Directory ID | Provide a valid directory ID. Click TEST CONNECTION button for testing the connection. |
Data Format | Historical data format. Available options are: - CSV - JSON - Parquet - ORC - avro - text |
Delimiter | Select the delimiter of historical data format.For example, if your historical data of type CSV is separated by comma(,) select delimiter as (,Comma) |
Container Name | ADLS container name from which the data should be read. |
ADLS Path | Provide directory path for ADLS file system. |
Cache Table | Enable caching of registered tables within application memory. If this option is selected then the table will be read only once after it is registered. This option is used to cache the registered table data and can be used for all available data sources. It is helpful in batch pipelines where the data is used at multiple places. |
Refresh Cache | Option to enable caching operation. |
Refresh Interval | The time interval after which the application’s page is to be reloaded or refreshed. |
Max no of rows | Provide maximum number of rows included. |
Is Header Included | If the first row of the file is header then mark this field else leave it unmarked. |
Post Query | Provide post query. Example: where column=value order by column desc limit 2 |
There is an option to further register table by clicking the +Register Table button. User can also add Environment Params by clicking at the +ADD PARAM button.
If data source selected is HDFS, there will be following fields:
Field | Description |
---|---|
Data Source | Select the HDFS source from where historical data is to be read. If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table Processor. In case of File Data Source, file data will be registered as a table. Internally file will be uploaded in default HDFS. |
Table Name | Specify a name for the table which is to be registered. |
Connection Name | All available connections are listed here. Select a connection from where data is to be read. |
Data Format | Historical data format. |
Delimiter | Select the delimiter of historical data format.For example, if your historical data of type CSV is separated by comma(,) select delimiter as (,Comma) |
HDFS Path | HDFS path where data is stored. |
Cache Table | If this option is selected then the table will be read only once after it is registered. This option is used to cache the registered table data and can be used for all available data sources. It is helpful in batch pipelines where the data is used at multiple places. |
Is Header Included | Select the checkbox if first row of data file is header else leave it unchecked. |
If you select the Data Source as HIVE or JDBC, there will be two additional fields:
Database Table Name
Execute Query
Field | Description |
---|---|
Database Table Name | Table from where data will be fetched. If option selected is “Database Table Name”, specify a name for the table. |
Execute Query | If option selected is “Execute Query”, you can write your custom query. Output of this query will be stored in current Spark session. |
If the user selects Snowflake, the below mentioned field will be additional.
Field | Description |
---|---|
Connection Name | The user will be required to provide the connection name for creating connection. |
Warehouse Name | Provide the warehouse name against this column. |
Schema Name | Provide the schema name against this column. Note: The user can provide the database table name or provide a query. |
If Data Source selected is S3, there will be one additional field:
- Bucket Name:
Field | Description |
---|---|
Bucket Name | S3 bucket name from where data will be read. |
If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table processor.
You need to specify the name with which table is to be registered after fetching the data.
If you select the Data Source as Cassandra, there will be two additional fields:
KeySpace Name
Cassandra Table Name
Field | Description |
---|---|
KeySpace Name | Cassandra Key Space Name |
Cassandra Table Name | Table name inside the keyspace from where we read data. |
If you select the Data Source as Couchbase, there will be an additional field:
- Bucket Name
Field | Description |
---|---|
Bucket Name | Couchbase Bucket Name. |
If you select the Data Source as Redshift, there will be a few additional fields, depending on the two following options:
- Database Table Name
Field | Description |
---|---|
Database Table Name | Name of the table from where data is to be fetched. |
Max no. of Rows | Specify the maximum number of rows. |
- Execute Query
Field | Description |
---|---|
Database Table Name | Name of the Redshift table from where data is to be fetched. |
Max no. of Rows | Write a custom query. Output of this query is stored in existing Spark session. |
If you have any feedback on Gathr documentation, please email us!