Configuring Cloudera to Support Lineage

For publishing lineage to Cloudera Navigator, configure Gathr with the following properties.

Configuration Properties

To go to the configuration properties, go To Superuser > Configuration > Others > Cloudera.

These properties are required to enable publishing of lineage to Cloudera Navigator:

PropertyDescription
Navigator URLThe http URL to the Cloudera Navigator UI.
Navigator API versionThe Navigator SDK API version. (read below)
Navigator Admin User-nameThe Cloudera Navigator Admin user.
Navigator Admin user PasswordThe Cloudera Navigator Admin password.
Autocommit enabledSpecifies if auto commit of entities is required.

cloudera

The Navigator API Version can be extracted from the Cloudera Navigator UI. To do so, click on the question mark, located on the top right corner of the web-page, next to user name. Select ‘About’.

admin

Mention the Cloudera Navigator version as the Navigator API version, as shown in the image below:

cloud_version

Configuring Lineage in Data Pipeline

The next step to enable publishing Lineage to Cloudera Navigator is while saving the Pipeline.

Click on the Publish Lineage to Cloudera Navigator check box available on the Pipeline Definition window as shown below:

pipe_def

You can publish Lineage for both – Batch and Streaming type of pipelines in Gathr.

The ‘Publish lineage to Cloudera Navigator’ checkbox is enabled by default for every pipeline.

You need to update the pipeline and start.

Viewing Pipeline Lineage on Navigator

The lineage of a pipeline is published to Cloudera Navigator once the pipeline is started.

Where the RabbitMQ Data Source configuration is as follows:

The Cloudera Navigator URL is as follows: with credentials (usually admin/admin).

Once the pipeline is active, go to the Cloudera Navigator UI and search for the pipeline name ‘sales_aggregation’.

cloudera

The components of the pipeline are listed as shown above. Please note the source and target RabbitMQ components are shown as two entities in Navigator – Rabbitmq and Rabbitmq_dataset. The Rabbitmq_dataset shows the schema of the data flowing in Gathr and the Rabbitmq shows the metadata of the actual RabbitMQ component.

Click on ‘Aggregation’ entity listed above and once the ‘Aggregation’ entity page opens, click on ‘Lineage’ tab, as shown below:

clouderat

Similarly, you can go to Cloudera Navigator search page and search by RabbitMQ Data Source’s queue name or exchange name and view the lineage by clicking on the RabbitMQ entity.

sale_data

You can also view the schema of the message configured on the Data Source using the ‘Details’ tab xof the RabbitMQ entity.

These are ‘Columns’ of the entity.

actios

Publishing lineage to Cloudera Navigator helps you easily integrate with the existing entities.

Consider a pipeline that processes data from RabbitMQ or Kafka, enriches the data and inserts the resultant data in a hive table.

Now, an external job (not a Gathr pipeline) reads from this Hive table and does further processing of data. If lineage is published by this external job as well, you will be able to see a combined enterprise lineage on Cloudera Navigator.   

Viewing Lineage of HDFS and Hive

The lineage of HDFS and Hive Data Sources and emitters are native entities in Cloudera Navigator if the configured HDFS path or Hive table already exists.

an_person

The lineage of native HDFS entity is shown in green and the lineage of native Hive entity is shown in yellow.

If the HDFS path or Hive table does not exist, the lineage is represented by creating a custom dataset which is a Grey dataset.

Top