Databricks Ingestion Source
The Databricks Data Source in Gathr facilitates data extraction from Databricks for analysis and transformation. It offers an easy configuration process to define extraction parameters.
Data Source Configuration
Configure the data source parameters as explained below.
Fetch From Source/Upload Data File
To design the application, you can either fetch the sample data from the Databricks source by providing the data source connection details or upload a sample data file in one of the supported formats to see the schema details during the application design phase.
Upload Data File
To design the application, please upload a data file containing sample records in a format supported by Gathr.
The sample data provided for application design should match the data source schema from which data will be fetched during runtime.
If Upload Data File method is selected to design the application, provide the below details.
File Format
Select the format of the sample file depending on the file type.
Gathr-supported file formats for Databricks data sources are CSV, JSON, XML, and Fixed Length.
For CSV file format, select its corresponding delimiter.
Header Included
Enable this option to read the first row as a header if your Databricks sample data file is in CSV format.
Upload
Please upload the sample file as per the file format selected above.
Fetch From Source
If Fetch From Source method is selected to design the application, then the data source connection details will be used to get sample data.
Continue to configure the data source.
Connection Name
Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Databricks earlier. Or create one as explained in the topic - Databricks Connection →
Use the Test Connection option to make sure that the connection with Databricks is established successfully.
A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve issue before proceeding further.
Use the Start Query Endpoint option to initialize the Databricks Query Endpoint if it is not running.
Is Unity Catalog Enabled?
You can check this option if your Databricks workspace is enabled for unity catalog.
To know how to verify if your workspace is already enabled for Unity Catalog, click here.
Catalog Name
All catalogs associated with your Databricks workspace will be listed when Unity Catalog is enabled.
Choose a catalog to list its schema names in the subsequent configuration.
Schema Name
Choose a schema to list its table names in the subsequent configuration.
Table Name
Table names will get listed as per the connection details configured.
Select the table name to fetch the data.
Query
SQL query to be executed in the component.
For example, select * from schema_name.table_name.
​
Design Time Query
Query used to fetch limited records during Application design. Used only during schema detection and inspection.
For example, select * from ( select * from schema_name.table_name ) alias limit 100
Enable Query Partitioning
This enables parallel reading of data from the table. Disabled by default. Table will be partitioned if this option is enabled.
Partition on Column
Specify the column to be used for partitioning the query data. In case the list is empty you or you do not see the required column, you can can type-in the column name.
Autodetect Bounds
Check this option to auto-detect the partition boundaries.
If Autodetect Bounds is enabled, additional fields will appear:
Row Count in Single Query
To load large volume of data from Databricks, system runs multiple parallel queries, where each query loads a subset of data. \nWith this property, you can specify the no. of rows to be read in a single query.
Column has Unique Values
Enable this if the column specified for partitioning has unique values.
If Autodetect Bounds is disabled, additional fields will appear:
No. of Partitions
Specifies the number of parallel threads to be invoked to partition the table while reading the data.
Lower Bound
Value of the lower bound for partitioning rows is to be given. It will be used to decide the partition boundaries. The entire dataset will be distributed into multiple chunks depending on the input value.
Upper Bound
Value of the upper bound for partitioning rows is to be given. It will be used to decide the partition boundaries. The entire dataset will be distributed into multiple chunks depending on the input value.
Fetch Size
The fetch size determines the number of rows to be fetched per round trip. The default value is 1000.
Schema
Check the populated schema details. For more details, see Schema Preview →
Advanced Configuration
Optionally, you can enable incremental read. For more details, see Databricks Incremental Configuration →
If you have any feedback on Gathr documentation, please email us!