Delta (Batch and Streaming) Data Source

On Delta Lake Channel, you should be able to read data from delta lake table on S3, HDFS, GCS, ADLS or DBFS. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. All data in Delta Lake is stored in Apache Parquet format. Delta Lake provides the ability to specify your schema and enforce it along with timestamps.

Configuring Delta Data Source

To add a Delta Data Source into your pipeline, drag the Data Source to the canvas and click on it to configure.

Under the Schema Type tab, you can Upload Data File and Fetch From Source. Below are the configuration details of the Delta Source (Batch and Streaming):

FieldDescription
SourceSelect source for reading the delta file from the available options in the drop down list: HDFS, S3, GCS, DBFS and ADLS.

Provide below fields if the user selects HDFS source for reading the data:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
UsernameOnce the Override Credentials option is checked, provide the user name through which the Hadoop service is running.
HDFS File PathProvide the file path of HDFS file system.
Time Travel Option

Select one of the time travel options:

- None: Option not to choose Time Travel.

- Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number.

- Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read.

Provide below fields if the user selects S3 source for reading the data:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
AWS Key IdProvide the S3 account access key.
Secret Access Key

Provide the S3 account secret key.

S3 Protocol

Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type:

- For HDP versions, S3a protocol is supported.

- For CDH versions, S3a protocol is supported.

- For Apache versions, S3n protocol is supported.

- For GCP, S3n and S3a protocol is supported.

- For Azure S3n protocol is supported. Read/Write to Mumbai and Ohio regions is not supported.

- For EMR S3, S3n, and S3a protocol is supported.

- For AWS Databricks, s3a protocol is supported.

Bucket NameProvide the S3 bucket name.
PathProvide the sub-directories of the bucket name on which the data is to be written.
Time Travel Option

Select one of the time travel options:

- None: Option not to choose Time Travel.

- Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number. ()

- Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read.

Provide below fields if the user selects GCS source for reading the data:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
Service Account Key

Upload the Service Account File.

User has an option to TEST CONNECTION.

Bucket NameProvide the GCS bucket name without any prefix.
PathProvide the sub-directories of the bucket name on which the data is to be written.
Time Travel Option

Select one of the time travel options:

- None: Option not to choose Time Travel.

- Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number.

- Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read.

Provide below fields if the user selects DBFS source for reading the data:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
Directory PathProvide the DBFS parent path for check-pointing.
DBFS File PathProvide the DBFS file path.

Provide below fields if the user selects ADLS source for reading the data:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Container NameProvide container name for azure delta lake storage.
ADLS File PathProvide the directory path for azure delta lake storage file system.
Time Travel Option

Select one of the time travel options:

- None: Option not to choose Time Travel.

- Version: Specify the version of delta file in order to fetch the older snapshot of the table with given version number.

- Timestamp: Specifies last modified time of file. All the files that have their last modified time greater than the present value should be read.

ADD CONFIGURATIONTo add additional custom properties in key-value pairs.
Environment ParamsUser can add further environment parameters. (Optional)

Configure Pre-Action in Source →

Top