RegisterAsTable Processor

This processor is used for fetching historical data from any streaming or batch source and that data gets registered as a table. This table can be referred further if you wish to perform some queries on registered data sources.

You can fetch tables from HDFS, JDBC, Snowflake, S3, Redshift, File and Incoming Message.

For example, you have historical data on HDFS which contains information about different departments of an organization. You can register that data using Register as Table processor and that registered table can be used in SQL processor to fetch number of employees in an organization.

Processor Configuration

If the data source is HDFS, there will be following fields:

Data Source: Select the source from where historical data is to be read. Available options are HDFS, JDBC, Snowflake, S3, Redshift, File and Incoming Message.

If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table Processor.

In case of File Data Source, file data will be registered as a table.

Internally file will be uploaded in default HDFS.

Table Name: Specify a name for the table which is to be registered.

Connection Name: All available connections are listed here.

Select a connection from where data is to be read or Add New connection.

Data Format: Historical data format.

Delimiter: Select the delimiter of historical data format.

For example, if your historical data of type CSV is separated by comma(,) select delimiter as (,Comma)

HDFS Path: HDFS path where data is stored.

Cache Table: If this option is selected then the table will be read only once after it is registered.

This option is used to cache the registered table data and can be used for all available data sources. It is helpful in batch pipelines where the data is used at multiple places.

Is Header Included: Select the checkbox if first row of data file is header else leave it unchecked.

Post Query: Post query like:
where column=value order by column desc limit 2

If you select the Data Source as JDBC, there will be two additional fields:

  • Database Table Name

  • Execute Query

Database Table Name: Table from where data will be fetched.

If option selected is “Database Table Name”, specify a name for the table.

Execute Query: If option selected is “Execute Query”, you can write your custom query. Output of this query will be stored in current Spark session.

If the user selects Snowflake, the below mentioned field will be additional:

Connection Name: You will be required to provide the connection name for creating connection.

Warehouse Name: Provide the warehouse name against this column.

Schema Name: Provide the schema name against this column.

If Data Source selected is S3, there will be one additional field:

  • Bucket Name

Bucket Name: S3 bucket name from where data will be read.

If the option selected is incoming message, output of sources connected before this processor will act as an incoming message for Register As Table processor.

You need to specify the name with which table is to be registered after fetching the data.

If you select the Data Source as Redshift, there will be a few additional fields, depending on the two following options:

  • Database Table Name

Database Table Name: Name of the table from where data is to be fetched.

Max no. of Rows: Specify the maximum number of rows.

  • Execute Query

Database Table Name: Name of the Redshift table from where data is to be fetched.

Max no. of Rows: Write a custom query. Output of this query is stored in existing Spark session.

Top