Delta Emitter
In this article
Using a Delta Lake Emitter, you can emit data to HDFS, S3, or DBFS in delta lake.
All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Delta Emitter Configuration
To add a Delta Emitter to your pipeline, drag the emitter onto the canvas, connect it to a Data Source or processor, and right-click on it to configure.
Field | Description |
---|---|
Emitter Type | Data Lake to which the data is emitted. For emitting the delta file the available options in the drop down list are: HDFS, S3, DBFS and ADLS. |
Connection Name | Connection Name for creating connection. |
Provide below fields if the user selects HDFS emitter type:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
Username | Once the Override Credentials option is checked, provide the user name through which the Hadoop service is running. |
HDFS File Path | Provide the file path of HDFS file system. |
Output Fields | Select the fields that needs to be included in the output data. |
Partitioning Required | Check mark the check box if the data is to be partitioned. If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode. |
Partitioned Column | If Partitioning Required field is checked, then select the fields on which data will be partitioned. |
Save Mode | Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
Provide below fields if the user selects S3 emitter type:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
AWS Key Id | Provide the S3 account access key. |
Secret Access Key | Provide the S3 account secret key. Once the AWS Key Id and Secret Access Key is provided, user has an option to test the connection. |
S3 Protocol | Select the S3 protocol from the drop down list. Below protocols are supported for various versions when user selects S3 connection type: - For HDP versions, S3a protocol is supported. - For CDH versions, S3a protocol is supported. - For Apache versions, S3n protocol is supported. - For GCP, S3n and S3a protocol is supported. - For Azure S3n protocol is supported. Read/Write to Mumbai and Ohio regions is not supported. - For EMR S3, S3n, and S3a protocol is supported. - For AWS Databricks, s3a protocol is supported. |
Bucket Name | Provide the S3 bucket name. |
Path | Provide the sub-directories of the bucket name on which the data is to be written. |
Output Fields | Select the fields that needs to be included in the output data. |
Partitioning Required | Check mark the check box if the data is to be partitioned. If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode. |
Partitioned Column | If Partitioning Required field is checked, then select the fields on which data will be partitioned. |
Save Mode | Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
Provide below fields if the user selects DBFS emitter type:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Override Credentials | Unchecked by default, check mark the checkbox to override credentials for user specific actions. |
Directory Path | Provide the DBFS parent path for check-pointing. |
DBFS File Path | Provide the DBFS file path. |
Output Fields | Select the fields that needs to be included in the output data. |
Partitioning Required | Check mark the check box if the data is to be partitioned. If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode. |
Partitioned Column | If Partitioning Required field is checked, then select the fields on which data will be partitioned. |
Save Mode | Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
Provide below fields if the user selects ADLS emitter type:
Connection Name | Select the connection name from the available list of connections, from where you would like to read the data. |
Container Name | Provide container name for azure delta lake storage. |
ADLS File Path | Provide the directory path for azure delta lake storage file system. |
Output Fields | Select the fields that needs to be included in the output data. |
Partitioning Required | Check mark the check box if the data is to be partitioned. If Streaming data source is used in the pipeline along with Aggregation without watermark then it is recommended not to use Append as output mode. |
Partitioned Column | If Partitioning Required field is checked, then select the fields on which data will be partitioned. |
Save Mode | Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
ADD CONFIGURATION | To add additional custom properties in key-value pairs. |
ADD PARAM | User can add further environment parameters. (Optional) |
If you have any feedback on Gathr documentation, please email us!