Delta Emitter

Using a Delta Lake Emitter, you can emit data to HDFS, S3, DBFS, ADLS, GCS in delta lake.

All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.

Delta Emitter Configuration

To add a Delta Emitter to your pipeline, drag the emitter onto the canvas, connect it to a Data Source or processor, and right-click on it to configure.

FieldDescription
Emitter TypeData Lake to which the data is emitted. For emitting the delta file the available options in the drop down list are: HDFS, S3, DBFS, ADLS and GCS.
Connection NameConnection Name for creating connection.

Provide below fields if the user selects HDFS emitter type:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
UsernameOnce the Override Credentials option is checked, provide the user name through which the Hadoop service is running.
HDFS File PathProvide the file path of HDFS file system.
Output FieldsSelect the fields that needs to be included in the output data.
Partitioning Required

Check mark the check box if the data is to be partitioned.

Partitioned ColumnIf Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Provide below fields if the user selects S3 emitter type:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
AWS Key IdProvide the S3 account access key.
Secret Access Key

Provide the S3 account secret key.

S3 Protocol

Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type:

- For HDP versions, S3a protocol is supported.

- For CDH versions, S3a protocol is supported.

- For Apache versions, S3n protocol is supported.

- For GCP, S3n and S3a protocol is supported.

- For Azure S3n protocol is supported. Read/Write to Mumbai and Ohio regions is not supported.

- For EMR S3, S3n, and S3a protocol is supported.

- For AWS Databricks, s3a protocol is supported.

Bucket NameProvide the S3 bucket name.
PathProvide the sub-directories of the bucket name on which the data is to be written.
Output FieldsSelect the fields that needs to be included in the output data.
Partitioning Required

Check mark the check box if the data is to be partitioned.

Partitioned ColumnIf Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Provide below fields if the user selects DBFS emitter type:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
Directory PathProvide the DBFS parent path for check-pointing.
DBFS File PathProvide the DBFS file path.
Output FieldsSelect the fields that needs to be included in the output data.
Partitioning Required

Check mark the check box if the data is to be partitioned.

Partitioned ColumnIf Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Provide below fields if the user selects ADLS emitter type:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Container NameProvide container name for azure delta lake storage.
ADLS File PathProvide the directory path for azure delta lake storage file system.
Output FieldsSelect the fields that needs to be included in the output data.
Partitioning Required

Check mark the check box if the data is to be partitioned.

Partitioned ColumnIf Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

ADD CONFIGURATIONTo add additional custom properties in key-value pairs.
ADD PARAMUser can add further environment parameters. (Optional)

Provide below fields if the user selects GCS emitter type:

Connection NameSelect the connection name from the available list of connections, from where you would like to read the data.
Override CredentialsUnchecked by default, check mark the checkbox to override credentials for user specific actions.
Service Account KeyUpload the GCP service account key file to create connection. Click TEST CONNECTION button in order to test the created connection.
Bucket NameProvide the bucket name without any prefix.
PathSub-directories of the bucket name mentioned above to which the data is to be written.
Output FieldsSelect the fields that needs to be included in the output data.
Partitioning Required

Check mark the check box if the data is to be partitioned.

Partitioned ColumnIf Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

PriorityPriority defines the execution order for the emitters.
ADD CONFIGURATIONTo add additional custom properties in key-value pairs.
ADD PARAMUser can add further environment parameters. (Optional)
Top