Delta Lake Batch ETL Source
Delta Lake allows you to read data from Delta tables stored on S3, DBFS, ADLS, and GCS. It provides a structured and transactional approach to reading and processing data. In batch ETL, it’s common to extract and process historical data stored on these platforms.
Schema Type
See the topic Provide Schema for ETL Source → to know how schema details can be provided for data sources.
After providing schema type details, the next step is to configure the data source.
Data Source Configuration
Configure the data source parameters that are explained below.
Source
Select a source for reading the delta file from S3, DBFS, ADLS, or GCS.
Connection Name
Provide connection details for S3, DBFS, ADLS, or GCS based on the chosen source.
Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details earlier.
Create a connection for S3 as explained in the topic - Amazon S3 Connection →
Create a connection for DBFS as explained in the topic - DBFS Connection →
Create a connection for ADLS as explained in the topic - ADLS Connection →
Create a connection for GCS as explained in the topic - GCS Connection →
For S3 and GCS sources provide below details:
Bucket Name
Specify the name of the storage bucket where your Delta Lake data is stored. The bucket name helps direct the data source to the correct storage location within the chosen cloud platform (S3 or GCS).
Path
Define the path to the specific location within the storage bucket where your Delta Lake data is stored. This path directs the data source to the precise directory or folder containing the data you want to access.
For DBFS source provide below details:
DBFS file path
File path for DBFS file system.
For ADLS source provide below details:
Container Name
ADLS container name from which the data should be read.
ADLS file path
File path for ADLS file system.
Time Travel Option
Choose how you want to access data history within your Delta Lake tables.
None: Select “None” if you don’t need historical data access. This option provides the most recent data only.
Version: Select “Version” to access data at specific versions in the table’s history. You can specify a version number in the Version field.
Timestamp: Select “Timestamp” to access data as it existed at a particular point in time. In the Timestamp field, specify the last modified time of a file, and all files with a last modified time greater than this value will be read.
Add Configuration: To add additional custom Delta Lake properties in a key-value pair.
Detect Schema
Check the populated schema details. For more details, see Schema Preview →
Pre Action
To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions →).
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
If you have any feedback on Gathr documentation, please email us!