Register as Table Processor

This processor is used for fetching historical data from any streaming or batch source and that data gets registered as a table. This table can be referred further if user has to perform some queries on registered data sources.

User can fetch tables from ADLS, Cassandra, Couchbase, Redshift, HDFS, Hive, JDBC, Snowflake, File, Incoming Message, GCS, BigQuery and S3.

For an example, the user has historical data on HDFS which contains information about different departments of an organization. The user can register that data using Register as Table processor and that registered table can be used in SQL processor to fetch number of employees in an organization.

Register as Table Processor Configuration

To add a Register as Table processor to your pipeline, drag the processor onto the canvas and click on it to configure.

If the data source is selected as ADLS, then there will be the following fields:

FieldDescription
Data Source

Select the ADLS source from where historical data is to be read.

Available options to fetch tables from are:

ADLS, Cassandra, Couchbase, Redshift, HDFS, Hive, JDBC, Snowflake, File, Incoming Message, GCS, BigQuery and S3.

If the option selected is incoming message, the output of sources connected before this processor will act as an incoming message for Register As Table Processor.

In case of File Data Source, file data will be registered as a table.

Internally file will be uploaded in default HDFS.

Table NameName with which the table is to be registered.
Connection NameAll available connections are listed here. Select a connection from where data is to be read.
Override CredentialCheck this option to override credentials for user specific actions. The below options are available once this option is checked.
Authentication TypeAzure ADLS authentication type.
Account NameProvide a valid Azure ADLS account name.
Account keyProvide a valid account key.
Client IDProvide a valid client ID.
Client Secret PasswordProvide a valid client secret password.
Directory IDProvide a valid directory ID. Click TEST CONNECTION button for testing the connection.
Data Format

Historical data format. Available options are:

- CSV

- JSON

- Parquet

- ORC

- avro

- text

DelimiterSelect the delimiter of historical data format.For example, if your historical data of type CSV is separated by comma(,) select delimiter as (,Comma)
Container NameADLS container name from which the data should be read.
ADLS PathProvide directory path for ADLS file system.
Cache Table

Enable caching of registered tables within application memory.

If this option is selected then the table will be read only once after it is registered.

This option is used to cache the registered table data and can be used for all available data sources. It is helpful in batch pipelines where the data is used at multiple places.

Refresh CacheOption to enable caching operation.
Refresh IntervalThe time interval after which the application’s page is to be reloaded or refreshed.
Max no of rowsProvide maximum number of rows included.
Is Header IncludedIf the first row of the file is header then mark this field else leave it unmarked.
Post QueryProvide post query. Example: where column=value order by column desc limit 2

There is an option to further register table by clicking the +Register Table button. User can also add Environment Params by clicking at the +ADD PARAM button.

If data source selected is HDFS, there will be following fields:

FieldDescription
Data Source

Select the HDFS source from where historical data is to be read.

If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table Processor.

In case of File Data Source, file data will be registered as a table.

Internally file will be uploaded in default HDFS.

Table NameSpecify a name for the table which is to be registered.
Connection NameAll available connections are listed here. Select a connection from where data is to be read.
Data FormatHistorical data format.
DelimiterSelect the delimiter of historical data format.For example, if your historical data of type CSV is separated by comma(,) select delimiter as (,Comma)
HDFS PathHDFS path where data is stored.
Cache Table

If this option is selected then the table will be read only once after it is registered.

This option is used to cache the registered table data and can be used for all available data sources. It is helpful in batch pipelines where the data is used at multiple places.

Is Header IncludedSelect the checkbox if first row of data file is header else leave it unchecked.

If you select the Data Source as HIVE or JDBC, there will be two additional fields:

  • Database Table Name

  • Execute Query

FieldDescription
Database Table Name

Table from where data will be fetched.

If option selected is “Database Table Name”, specify a name for the table.

Execute Query

If option selected is “Execute Query”, you can write your custom query.

Output of this query will be stored in current Spark session.

If the user selects Snowflake, the below mentioned field will be additional.

FieldDescription
Connection NameThe user will be required to provide the connection name for creating connection.
Warehouse NameProvide the warehouse name against this column.
Schema Name

Provide the schema name against this column.

Note: The user can provide the database table name or provide a query.

If Data Source selected is S3, there will be one additional field:

  • Bucket Name:
FieldDescription
Bucket NameS3 bucket name from where data will be read.

If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table processor.

You need to specify the name with which table is to be registered after fetching the data.

If you select the Data Source as Cassandra, there will be two additional fields:

  • KeySpace Name

  • Cassandra Table Name

FieldDescription
KeySpace NameCassandra Key Space Name
Cassandra Table NameTable name inside the keyspace from where we read data.

If you select the Data Source as Couchbase, there will be an additional field:

  • Bucket Name
FieldDescription
Bucket NameCouchbase Bucket Name.

If you select the Data Source as Redshift, there will be a few additional fields, depending on the two following options:

  • Database Table Name
FieldDescription
Database Table NameName of the table from where data is to be fetched.
Max no. of RowsSpecify the maximum number of rows.
  • Execute Query
FieldDescription
Database Table NameName of the Redshift table from where data is to be fetched.
Max no. of Rows

Write a custom query.

Output of this query is stored in existing Spark session.

Top