Hive Emitter

Hive emitter allows you to store streaming/batch data into HDFS. Hive queries can be implemented to retrieve the stored data.

To configure a Hive emitter, provide the database name, table name along with the list of fields of schema to be stored. This list of data rows get stored in Hive table, in a specified format, inside the provided database.

You must have the necessary permissions for creating table partitions and then writing to partition tables.

Hive Emitter Configuration

To add a Hive emitter to your pipeline, drag it onto the canvas, connect it to a Data Source or processor, and right-click on it to configure.

FieldDescription
Save as DatasetSelect the checkbox to save the schema as Dataset.
Connection NameAll Hive connections will be listed here. Select a connection for connecting to Hive.
Checkpoint Storage LocationSelect the checkpointing storage location. Available options are HDFS, S3, and EFS.
Checkpoint ConnectionsSelect the connection. Connections are listed corresponding to the selected storage location.
Checkpoint Directory

It is the path where Spark Application stores the checkpointing data.

For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself.

For S3, enter an absolute path like: S3://BucketName/checkpointingDir

Time-Based Check PointSelect checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Database NameHIVE database name.
Table NameHIVE table name.
Output FieldsFields in the schema that needs to be a part of the output data.
Lower CaseConvert all the selected partition columns into the lower while writing data into hive.
Format

TEXT: Stores information as plain text. Space’ ’ Delimiter is not supported in TEXT format.

ORC: ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats.

AVRO: AVRO stores the data definition in JSON format making it easy to read and interpret.

Parquet: Parquet stores nested data structures in a flat columnar format.

DelimiterMessage field separator.
Output Mode

Output mode to be used while writing the data to Streaming sink.

Append: Output Mode in which only the new rows in the streaming data will be written to the sink

Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates.

Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.

Save Mode

Save Mode is used to specify the expected behavior of saving data to a data sink.

ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown.

Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data.

Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data.

Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data.

This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

ReplicationEnables to copy your data on underlying Hadoop file system. For example, if you specify “2” as Replication, then two copies will be created on HDFS.
Enable TriggerTrigger defines how frequently a streaming query should be executed.
Processing Time

It will appear only when Enable Trigger checkbox is selected.

Processing Time is the trigger time interval in minutes or seconds.

Add ConfigurationEnables to configure custom properties.
Schema Results
Column NameName of the column populated from the selected Table.
Mapping ValueMap a corresponding value to the column.
Data TypeData type of the Mapped Value.
Ignore All

Select the Ignore All check box to ignore all the Schema Results or select a checkbox adjacent to the column to ignore that column from the Schema Results.

Use Ignore All or selected fields while pushing data to emitter.

This will add that field as the part of partition fields while creating the table.

Add partition ColumnThis will add that field as the part of partition fields while creating the table.
Auto Fill

Auto Fill automatically populates and map all incoming schema fields with the fetched table columns. The left side shows the table columns and right side shows the incoming schema fields.

If same field, as of table column, not found in incoming schema then the first field will be selected by default.

Download MappingIt downloads the mappings of schema fields and table columns in a file.
Upload MappingUploading the mapping file automatically populates the table columns and schema fields.

Click on the Next button. Enter the notes in the space provided.

Click on the DONE button after entering all the details.

Top