Hive Emitter
In this article
Hive emitter allows you to store streaming/batch data into HDFS. Hive queries can be implemented to retrieve the stored data.
To configure a Hive emitter, provide the database name, table name along with the list of fields of schema to be stored. This list of data rows get stored in Hive table, in a specified format, inside the provided database.
You must have the necessary permissions for creating table partitions and then writing to partition tables.
Hive Emitter Configuration
To add a Hive emitter to your pipeline, drag it onto the canvas, connect it to a Data Source or processor, and right-click on it to configure.
Field | Description |
---|---|
Save as Dataset | Select the checkbox to save the schema as Dataset. |
Connection Name | All Hive connections will be listed here. Select a connection for connecting to Hive. |
Checkpoint Storage Location | Select the checkpointing storage location. Available options are HDFS, S3, and EFS. |
Checkpoint Connections | Select the connection. Connections are listed corresponding to the selected storage location. |
Checkpoint Directory | It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like /user/hadoop/, checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: S3://BucketName/checkpointingDir |
Time-Based Check Point | Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis. |
Database Name | HIVE database name. |
Table Name | HIVE table name. |
Output Fields | Fields in the schema that needs to be a part of the output data. |
Lower Case | Convert all the selected partition columns into the lower while writing data into hive. |
Format | TEXT: Stores information as plain text. Space’ ’ Delimiter is not supported in TEXT format. ORC: ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. AVRO: AVRO stores the data definition in JSON format making it easy to read and interpret. Parquet: Parquet stores nested data structures in a flat columnar format. HDP 3.1.0, for an ORC format, table is created as Manage Table and for other formats such as Avro, Parquet and Text format, an external table is created. |
Delimiter | Message field separator. |
Output Mode | Output mode to be used while writing the data to Streaming sink. Append: Output Mode in which only the new rows in the streaming data will be written to the sink Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates. |
Save Mode | Save Mode is used to specify the expected behavior of saving data to a data sink. ErrorifExist: When persisting data, if the data already exists, an exception is expected to be thrown. Append: When persisting data, if data/table already exists, contents of the Schema are expected to be appended to existing data. Overwrite: When persisting data, if data/table already exists, existing data is expected to be overwritten by the contents of the Data. Ignore: When persisting data, if data/table already exists, the save operation is expected to not save the contents of the Data and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
Replication | Enables to copy your data on underlying Hadoop file system. For example, if you specify “2” as Replication, then two copies will be created on HDFS. |
Enable Trigger | Trigger defines how frequently a streaming query should be executed. |
Processing Time | It will appear only when Enable Trigger checkbox is selected. Processing Time is the trigger time interval in minutes or seconds. |
Add Configuration | Enables to configure custom properties. |
Schema Results | |
Column Name | Name of the column populated from the selected Table. |
Mapping Value | Map a corresponding value to the column. |
Data Type | Data type of the Mapped Value. |
Ignore All | Select the Ignore All check box to ignore all the Schema Results or select a checkbox adjacent to the column to ignore that column from the Schema Results. Use Ignore All or selected fields while pushing data to emitter. This will add that field as the part of partition fields while creating the table. |
Add partition Column | This will add that field as the part of partition fields while creating the table. |
Auto Fill | Auto Fill automatically populates and map all incoming schema fields with the fetched table columns. The left side shows the table columns and right side shows the incoming schema fields. If same field, as of table column, not found in incoming schema then the first field will be selected by default. |
Download Mapping | It downloads the mappings of schema fields and table columns in a file. |
Upload Mapping | Uploading the mapping file automatically populates the table columns and schema fields. |
Click on the Next button. Enter the notes in the space provided.
Click on the DONE button after entering all the details.
If you have any feedback on Gathr documentation, please email us!