HBase Emitter
In this article
HBase emitter stores streaming data into HBase. It provides quick random access to huge amount of structured data.
HBase Emitter Configuration
To add HBase emitter to your pipeline, drag it onto the canvas, connect it to a Data Source or processor, and right-click on it to configure.
Field | Description |
---|---|
Connection Name | All HBase connections will be listed here. Select a connection for connecting to HBase. |
Batch Size | If user wants to index records in batch, for that the user has to specify batch size. |
Table Name Expression | Javascript expression used to evaluate table name. The keyspace will be formed as ns_+{tenanatId}. For example, ns_1 |
Compression | Provides the facility to compress the message before storing it. The algorithm used is Snappy. When selected true, enables compression on data |
Region Splitting Definition | This functionality defines how the HBase tables should be pre-split. The default value is ‘No pre-split’. The supported options are: Default: No Pre-Split- Only one region will be created initially. Based on Region Boundaries: Regions are created based on given key boundaries. For example, if your key is a hexadecimal key and you provide a value ‘4, 8, d’, it will create four regions as follows: 1st region for keys less than 4. 2nd region for keys greater than 4 and less than 8. 3rd region for keys greater than 8 and less than d. 4th region for keys greater than d. |
Encoding | Data encoding type either UTF-8 (base encoding) or BASE 64(64 bit encoding). |
Row Key Generator Type | Enables to generate the custom row key. Following type of key generators are available: UUID: Universally unique identifier. Key Based: In this case, key is generated by appending the values of selected fields. An additional field – “Key Fields” will be displayed where you can select the keys you want to combine. The keys will be appended in the same order as selected on the user interface. Custom: Write your custom logic to create the row key. For example, if you want to use an UUID key but want to prefix it with HSBC, then you can write the logic in a Java class. If you select this option then an additional field - “Class Name” will be displayed on UI where you need to mention the fully qualified class name of your Java class. You can download the sample project from the “Data Pipeline” landing page and refer Java class “com.yourcompany.custom.keygen.SampleKeyGenerator” to write the custom code. |
Column Family | Specify the name of column family that will be used while saving your data in a HBase table. |
Emitter Output Fields | Select the emitter out put fields. |
Output Fields | Fields in the message that needs to be a part of the output message. |
Replication | Enables to copy your data on underlying Hadoop file system. For example, if you specify “2” as Replication, then two copies will be created on HDFS |
Ignore Missing Values | Ignore or persist empty or null values of message fields in emitter. When selected true, ignores null value of message fields. |
Connection Retries | The number of retries for component connection. Possible values are -1, 0 or positive number. -1 denotes infinite retries. |
Delay Between Connection Retries | Defines the retry delay intervals for component connection in millis. |
Enable TTL | Specifies the life time of a record. When selected, record will persist for that time duration which you specify in TTL field text box. |
TTL Type | Specify the TTL type as Static or Field Value. |
TTL Value | Provide TTL value in seconds in case of static TTL type or integer field in case of Field Value. |
Checkpoint Storage Location | Select the checkpointing storage location. Available options are HDFS, S3, and EFS. |
Checkpoint Connections | Select the connection. Connections are listed corresponding to the selected storage location. |
Checkpoint Directory | It is the path where Spark Application stores the checkpointing data. For HDFS and EFS, enter the relative path like /user/hadoop/ , checkpointingDir system will add suitable prefix by itself. For S3, enter an absolute path like: S3://BucketName/checkpointingDir |
Time-Based Check Point | Select checkbox to enable timebased checkpoint on each pipeline run i.e. in each pipeline run above provided checkpoint location will be appended with current time in millis. |
Output Mode | Output mode to be used while writing the data to Streaming sink. Select the output mode from the given three options: Append: Output Mode in which only the new rows in the streaming data will be written to the sink. Complete Mode: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates. Update Mode: Output Mode in which only the rows that were updated in the streaming data will be written to the sink every time there are some updates. |
Enable Trigger | Trigger defines how frequently a streaming query should be executed. |
ADD CONFIGURATION | Enables to configure additional properties. Index_field and store_field support is there using Add Configuration. |
Click on the Next button. Enter the notes in the space provided.
Click on the DONE button for saving the configuration.
If you have any feedback on Gathr documentation, please email us!