Delta (Batch and Streaming) Data Source
In this article
On Delta Lake Channel, you should be able to read data from delta lake table on S3, HDFS, GCS, ADLS or DBFS. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. All data in Delta Lake is stored in Apache Parquet format. Delta Lake provides the ability to specify your schema and enforce it along with timestamps.
Configuring Delta Data Source
To add a Delta Data Source into your pipeline, drag the Data Source to the canvas and click on it to configure.
Under the Schema Type tab, you can Upload Data File and Fetch From Source. Below are the configuration details of the Delta Source (Batch and Streaming):
Field | Description |
---|---|
Source | Select source for reading the delta file from the available options in the drop down list: HDFS, S3, GCS, DBFS and ADLS. |
Provide below fields if the user selects HDFS source for reading the data:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
Username | Once the Override Credentials option is checked, provide the user name through which the Hadoop service is running. |
HDFS File Path | Provide the file path of HDFS file system. |
Time Travel Option | Select one of the time travel options: - None: Option not to choose Time Travel. - Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number. - Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read. The Time Travel option is not available for Streaming Delta source. |
Provide below fields if the user selects S3 source for reading the data:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
AWS Key Id | Provide the S3 account access key. |
Secret Access Key | Provide the S3 account secret key. Once the AWS Key Id and Secret Access Key is provided, user has an option to test the connection. |
S3 Protocol | Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type: - For HDP versions, S3a protocol is supported. - For CDH versions, S3a protocol is supported. - For Apache versions, S3n protocol is supported. - For GCP, S3n and S3a protocol is supported. - For Azure S3n protocol is supported. Read/Write to Mumbai and Ohio regions is not supported. - For EMR S3, S3n, and S3a protocol is supported. - For AWS Databricks, s3a protocol is supported. |
Bucket Name | Provide the S3 bucket name. |
Path | Provide the sub-directories of the bucket name on which the data is to be written. |
Time Travel Option | Select one of the time travel options: - None: Option not to choose Time Travel. - Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number. () - Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read. The Time Travel option is not available for Streaming Delta source. |
Provide below fields if the user selects GCS source for reading the data:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
Service Account Key | Upload the Service Account File. User has an option to TEST CONNECTION. |
Bucket Name | Provide the GCS bucket name without any prefix. |
Path | Provide the sub-directories of the bucket name on which the data is to be written. |
Time Travel Option | Select one of the time travel options: - None: Option not to choose Time Travel. - Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number. - Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read. The Time Travel option is not available for Streaming Delta source. |
Provide below fields if the user selects DBFS source for reading the data:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
Directory Path | Provide the DBFS parent path for check-pointing. |
DBFS File Path | Provide the DBFS file path. |
Provide below fields if the user selects ADLS source for reading the data:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Container Name | Provide container name for azure delta lake storage. |
ADLS File Path | Provide the directory path for azure delta lake storage file system. |
Time Travel Option | Select one of the time travel options: - None: Option not to choose Time Travel. - Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number. - Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read. The Time Travel option is not available for Streaming Delta source. |
ADD CONFIGURATION | To add additional custom properties in key-value pairs. |
Environment Params | User can add further environment parameters. (Optional) |
Configure Pre-Action in Source →
If you have any feedback on Gathr documentation, please email us!