Delta Emitter

Using a Delta Lake Emitter, you can emit data to HDFS, S3, DBFS, ADLS, GCS in delta lake.

All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.

Delta Emitter Configuration

To add a Delta Emitter to your pipeline, drag the emitter onto the canvas, connect it to a Data Source or processor, and right-click on it to configure.

👉

If the data source in pipeline has a streaming component, then the emitter will show four additional properties: Checkpoint Storage Location, Checkpoint Connections, Checkpoint Directory, and Time-Based checkpoint.

Field	Description
Emitter Type	Data Lake to which the data is emitted. For emitting the delta file the available options in the drop down list are: HDFS, S3, DBFS, ADLS and GCS.
Connection Name	Connection Name for creating connection.

Provide below fields if the user selects HDFS emitter type:


Connection Name	Select the connection name from the available list of connections, from where you would like to read the data.
Override Credentials	Unchecked by default, check mark the checkbox to override credentials for user specific actions.
Username	Once the Override Credentials option is checked, provide the user name through which the Hadoop service is running.
HDFS File Path	Provide the file path of HDFS file system.
Output Fields	Select the fields that needs to be included in the output data.
Partitioning Required	Check mark the check box if the data is to be partitioned. 👉 If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode.
Partitioned Column	If Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode	Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Provide below fields if the user selects S3 emitter type:


Connection Name	Select the connection name from the available list of connections, from where you would like to read the data.
Override Credentials	Unchecked by default, check mark the checkbox to override credentials for user specific actions.
AWS Key Id	Provide the S3 account access key.
Secret Access Key	Provide the S3 account secret key. 👉 Once the AWS Key Id and Secret Access Key is provided, user has an option to test the connection.
S3 Protocol	Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type: - For HDP versions, S3a protocol is supported. - For CDH versions, S3a protocol is supported. - For Apache versions, S3n protocol is supported. - For GCP, S3n and S3a protocol is supported. - For Azure S3n protocol is supported. Read/Write to Mumbai and Ohio regions is not supported. - For EMR S3, S3n, and S3a protocol is supported. - For AWS Databricks, s3a protocol is supported.
Bucket Name	Provide the S3 bucket name.
Path	Provide the sub-directories of the bucket name on which the data is to be written.
Output Fields	Select the fields that needs to be included in the output data.
Partitioning Required	Check mark the check box if the data is to be partitioned. 👉 If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode.
Partitioned Column	If Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode	Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Provide below fields if the user selects DBFS emitter type:


Connection Name	Select the connection name from the available list of connections, from where you would like to read the data.
Override Credentials	Unchecked by default, check mark the checkbox to override credentials for user specific actions.
Directory Path	Provide the DBFS parent path for check-pointing.
DBFS File Path	Provide the DBFS file path.
Output Fields	Select the fields that needs to be included in the output data.
Partitioning Required	Check mark the check box if the data is to be partitioned. 👉 If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode.
Partitioned Column	If Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode	Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

Provide below fields if the user selects ADLS emitter type:


Connection Name	Select the connection name from the available list of connections, from where you would like to read the data.
Container Name	Provide container name for azure delta lake storage.
ADLS File Path	Provide the directory path for azure delta lake storage file system.
Output Fields	Select the fields that needs to be included in the output data.
Partitioning Required	Check mark the check box if the data is to be partitioned. 👉 If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode.
Partitioned Column	If Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode	Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
ADD CONFIGURATION	To add additional custom properties in key-value pairs.
ADD PARAM	User can add further environment parameters. (Optional)

Provide below fields if the user selects GCS emitter type:


Connection Name	Select the connection name from the available list of connections, from where you would like to read the data.
Override Credentials	Unchecked by default, check mark the checkbox to override credentials for user specific actions.
Service Account Key	Upload the GCP service account key file to create connection. Click TEST CONNECTION button in order to test the created connection.
Bucket Name	Provide the bucket name without any prefix.
Path	Sub-directories of the bucket name mentioned above to which the data is to be written.
Output Fields	Select the fields that needs to be included in the output data.
Partitioning Required	Check mark the check box if the data is to be partitioned. 👉 If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode.
Partitioned Column	If Partitioning Required field is checked, then select the fields on which data will be partitioned.
Save Mode	Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
Priority	Priority defines the execution order for the emitters.
ADD CONFIGURATION	To add additional custom properties in key-value pairs.
ADD PARAM	User can add further environment parameters. (Optional)

If you have any feedback on Gathr documentation, please email us!

Delta Emitter

Delta Emitter Configuration #

Delta Emitter Configuration