Hive Emitter

Hive emitter allows you to store streaming/batch data into HDFS. Hive queries can be implemented to retrieve the stored data.

To configure a Hive emitter, provide the database name, table name along with the list of fields of schema to be stored. This list of data rows get stored in Hive table, in a specified format, inside the provided database.

You must have the necessary permissions for creating table partitions and then writing to partition tables.

Hive Emitter Configuration

To add a Hive emitter to your pipeline, drag it onto the canvas, connect it to a Data Source or processor, and right-click on it to configure.

👉

If the data source in pipeline has a streaming component, then the emitter will show four additional properties: Checkpoint Storage Location, Checkpoint Connections, Checkpoint Directory, and Time-Based checkpoint.

Field	Description
Save as Dataset	Select the checkbox to save the schema as Dataset.
Connection Name	All Hive connections will be listed here. Select a connection for connecting to Hive.
Checkpoint Storage Location	Select the checkpointing storage location. Available options are HDFS, S3, and EFS.
Checkpoint Connections	Select the connection. Connections are listed corresponding to the selected storage location.
Checkpoint Directory	It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: S3://BucketName/checkpointingDir
Time-Based Check Point	Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis.
Database Name	HIVE database name.
Table Name	HIVE table name.
Output Fields	Fields in the schema that needs to be a part of the output data.
Lower Case	Convert all the selected partition columns into the lower while writing data into hive.
Format	TEXT: Stores information as plain text. Space’ ’ Delimiter is not supported in TEXT format. ORC: ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. AVRO: AVRO stores the data definition in JSON format making it easy to read and interpret. Parquet: Parquet stores nested data structures in a flat columnar format. 👉 HDP 3.1.0, for an ORC format, table is created as Manage Table and for other formats such as Avro, Parquet and Text format, an external table is created.
Delimiter	Message field separator.
Output Mode	Output mode to be used while writing the data to Streaming sink. Append: Output Mode in which only the new rows in the streaming data will be written to the sink Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates.
Save Mode	Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
Replication	Enables to copy your data on underlying Hadoop file system. For example, if you specify “2” as Replication, then two copies will be created on HDFS.
Enable Trigger	Trigger defines how frequently a streaming query should be executed.
Processing Time	It will appear only when Enable Trigger checkbox is selected. Processing Time is the trigger time interval in minutes or seconds.
Add Configuration	Enables to configure custom properties.
Schema Results
Column Name	Name of the column populated from the selected Table.
Mapping Value	Map a corresponding value to the column.
Data Type	Data type of the Mapped Value.
Ignore All	Select the Ignore All check box to ignore all the Schema Results or select a checkbox adjacent to the column to ignore that column from the Schema Results. Use Ignore All or selected fields while pushing data to emitter. This will add that field as the part of partition fields while creating the table.
Add partition Column	This will add that field as the part of partition fields while creating the table.
Auto Fill	Auto Fill automatically populates and map all incoming schema fields with the fetched table columns. The left side shows the table columns and right side shows the incoming schema fields. If same field, as of table column, not found in incoming schema then the first field will be selected by default.
Download Mapping	It downloads the mappings of schema fields and table columns in a file.
Upload Mapping	Uploading the mapping file automatically populates the table columns and schema fields.

Click on the Next button. Enter the notes in the space provided.

Click on the DONE button after entering all the details.

If you have any feedback on Gathr documentation, please email us!

Hive Emitter

Hive Emitter Configuration #

Hive Emitter Configuration