Databricks ETL Source

The Databricks Data Source in Gathr facilitates data extraction from Databricks for analysis and transformation. It offers an easy configuration process to define extraction parameters.

Schema Type

See the topic Provide Schema for ETL Source → to know how schema details can be provided for data sources.

After providing schema type details, the next step is to configure the data source.

Data Source Configuration

Configure the data source parameters as explained below.

Connection Name

Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Databricks earlier. Or create one as explained in the topic - Databricks Connection →

Use the Test Connection option to make sure that the connection with Databricks is established successfully.

A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve issue before proceeding further.

💡

If Databricks SQL Warehouse is not running, a warning is displayed requesting user to initialize the SQL Warehouse.

Use the Start SQL Warehouse option to initialize the Databricks SQL Warehouse if it is not running.

💡

When connected to the Databricks Engine, connection details will not be asked while configuring Databricks channels and emitters. Instead, your Databricks instance session will be used to retrieve the required metadata.

Is Unity Catalog Enabled?

You can check this option if your Databricks workspace is enabled for unity catalog.

To know how to verify if your workspace is already enabled for Unity Catalog, click here.

Catalog Name

All catalogs associated with your Databricks workspace will be listed when Unity Catalog is enabled.

Choose a catalog to list its schema names in the subsequent configuration.

Schema Name

Choose a schema to list its table names in the subsequent configuration.

Table Name

Table names will get listed as per the connection details configured.

Select the table name to fetch the data.

Query

SQL query to be executed in the component.

For example, select * from schema_name.table_name.

Design Time Query

Query used to fetch limited records during Application design. Used only during schema detection and inspection.

For example, select * from ( select * from schema_name.table_name ) alias limit 100

Enable Query Partitioning

This enables parallel reading of data from the table. Disabled by default. Table will be partitioned if this option is enabled.

💡

This field will not appear when you are connected to the Databricks Engine.

Type-in Partition Column

Select this option if Partition Column list shown is empty or you do not see the required column in the list.

Partition on Column

Specify the column to be used for partitioning the query data. In case the list is empty you or you do not see the required column, you can can type-in the column name.

Data Type

In case you have typed-in the partitioning column, you need to specify the datatype of that column here.

Autodetect Bounds

Check this option to auto-detect the partition boundaries.

Row Count in Single Query

To load large volume of data from Databricks, system runs multiple parallel queries, where each query loads a subset of data. \nWith this property, you can specify the no. of rows to be read in a single query.

Column has Unique Values

Enable this if the column specified for partitioning has unique values.

No. of Partitions

Specifies the number of parallel threads to be invoked to partition the table while reading the data.

Lower Bound

Value of the lower bound for partitioning rows is to be given. It will be used to decide the partition boundaries. The entire dataset will be distributed into multiple chunks depending on the input value.

Upper Bound

Value of the upper bound for partitioning rows is to be given. It will be used to decide the partition boundaries. The entire dataset will be distributed into multiple chunks depending on the input value.

Fetch Size

The fetch size determines the number of rows to be fetched per round trip. The default value is 1000.

💡

This field will not appear when you are connected to the Databricks Engine.

Detect Schema

Check the populated schema details. For more details, see Schema Preview →

Incremental Read

Optionally, you can enable incremental read. For more details, see Databricks Incremental Configuration →

Pre Action

To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions →

Notes

Optionally, enter notes in the Notes → tab and save the configuration.

If you have any feedback on Gathr documentation, please email us!

Databricks ETL Source

Schema Type #

Data Source Configuration #

Connection Name #

Is Unity Catalog Enabled? #

Catalog Name #

Schema Name #

Table Name #

Query #

Design Time Query #

Enable Query Partitioning #

Type-in Partition Column #

Partition on Column #

Data Type #

Autodetect Bounds #

Row Count in Single Query #

Column has Unique Values #

No. of Partitions #

Lower Bound #

Upper Bound #

Fetch Size #

Detect Schema #

Incremental Read #

Pre Action #

Notes #