Datasets List

The Dataset homepage shows a list of all the Datasets created. The below screenshot shows the same with the properties and description table below:

Property	Description
Name	Name of the dataset.
Description	Description of the dataset.
Source Type	Data source type.
Actions	View Dataset When you click on the eye icon, a view-dataset window opens. This is the page where you can view the schema of the dataset, versions of the dataset and other options explained below in hierarchal form.
Delete	A new option to delete a Dataset is being introduced. Datasets that are not being used in any pipeline or validation jobs can be deleted with this new option. 👉 An existing Dataset that is being consumed by any pipeline or validation jobs cannot be deleted unless it is removed from its associated pipeline or validation job.

View Dataset

Click on the view dataset eye icon under the Actions tab of the Dataset homepage. A window will open with the details of the dataset, with two following tabs; Summary and Explore.

👉

You can view both, the existing external Dataset and the Dataset created from the pipeline canvas.

Summary

The Summary button shows the summary of your Dataset. The right panel is the description window, which can be edited and saved in the same window. Left Panel has the Data Source or Emitter Dataset details, where you can view schema details, connection detail and other properties, explained in the table below:

The screenshot under View Dataset, shows the Dataset details and active version of DFS Data Source, under the About button. Below mentioned table describes all the properties of the Summary tab button.

Property	Description

Dataset Name


Last Read From Source	Last time the data was read from the data source. The date and time are mentioned here.
Last Profile Run	It shows when the last Dataset profile was generated. The profile is generated when you run the dataset profile from Run Profile.
File Path	The file path and the format of data type, where the data will be read while generating profile. In few components, HDFS path is replaced by Query/ Database name or Table name and the configured query is reflected here. For example, in case of JDBC, the query mentioned here will run while generating the profile. In case of HDFS, the configured path is where the data will be read from. Note: All the above properties should be re-configured when a new dataset version is created.
Number of Columns	Number of columns from the data.
Number of Records	The total number of records when the last profile was generated in the data source.
Schema and Rules	This click-able button opens a window at the bottom of this pane and displays the versions and corresponding rules applied on the dataset. In the Schema window, you can view the Alias and the Datatype of the schema.
Profile History	Number of times the profile was generated. You can view the respective results.
Version	The latest version of the dataset.
Run Profile Status	This tab shows the current state of the profile, for if it is in execution state or stopped state. There are two buttons, Play and Stop. These buttons allow you to play and stop the profile. The menu button, has two options: Schedule Job Configure Job They are explained below.
Tags	Associate tags with the dataset. Tags can also be updated from the same window.
Description	The description provided while creating a dataset. You can edit the description within this window, and a “Description updated successfully” message will pop-up.

Dataset Lineage


Select Version	Select the version of the dataset that you want to view.
Alias	Name of the fields
Datatype	Datatype of the field (Int, string)

When you do a mouse-hover on the Path details, it gives the name of the connection, as shown below:

Schema

Schema opens a new window beneath the Schema panel. Select the Versions of the dataset and it will list the Aliases with their Data Types.

Profile History

Profile History opens a new window beneath the View Schema panel. A tabular form of profile history is shown with details of the Dataset profile:

Property	Description
Version	Version number of the Dataset.
Number of Columns	Number of columns in the Dataset.
Number of Records	Number of records in the Dataset
Last Profile Run	The date and time on which the Profile was run.
Action	View the profile results.

Run Profile Status

Run Profile Status shows the current state of the profile execution, for if it is in Starting/Active/or Stopped mode.

You can Stop and Play the profile using the respective buttons as well.

A pipeline gets submitted on the cluster. This pipeline will have a nomenclature as explained below:

System prefix of the Pipeline_Dataset Name_DatasetVersion_Timestamp.

For example, SAx\_DatasetProfileGenerator\_IrisInputData\_0\_1559296535220

This pipeline will be submitted as a batch job in the cluster.

The options available under Run Profile Status are:

Property	Description
Cluster Configuration	The details of cluster configuration is available.
Configure Job	The configure Job window opens. Below options are available on the Configure Job window.

Name	Description
Select Cluster	Select IBM Conductor cluster for job configuration.
Instance Group Name	Select Instance Group to configure spark job.
Spark Master Url	Master URL to submit or view spark job.

SPARK CONDUCTOR EGO PROPERTIES

IBM Conductor EGO Configuration

Name	Description
Executor Maximum Slots	Maximum number of executor slots.
Executor Idle Time	Specifies the duration (in seconds) for executor state to remain alive without any workload running on it.
Maximum Slots	Specifies maximum number of slots that an appication can get in master node.
Slots Per Task	Specifies number of slots that are allocated to a task.
GPU Max Slots	Specifes maximum number of slots that an application can get for GPU task in a master node.
Priority	Specifies the priority of driver and executor scheduling for spark instance group. Valid range is 1-10000. Default is 5000.

To add Environment Variable, click the ADD ENVIRONMENT VARIABLE BUTTON. Option to ADD CONFIGURATIONS is also available.

Click CONFIRM once details are provided in the Configure Job window.

Along with executing the profile, you can also configure the job and schedule the job, as explained below:

Configure Profile

User can tune the job in this window by providing driver and executer related parameters. To know more, see the Configure Profile field in the table Actions on Pipeline.

To check the errors on this job, you can also configure error properties from this window.

Schedule Job

Schedule job enables a dataset to run a Job as per the defined cron expression.

Once you are defining a cron expression, you will have the option to Schedule a job and once it is scheduled, then an UN-SCHEDULE and RESCHEDULE button will be available.

Dataset Lineage

The Dataset Lineage window lists all the versions of the dataset and its complete life cycle.

You can view the dataset lineage by selecting the version. Data Lineage represents the association of the dataset in the pipeline. It shows the used channel or emitter used within the data pipeline.

An association is defined if dataset schema and rules are used in the channel. This helps to use the same entities in multiple pipeline channels, as Use Existing Dataset.

In the case of an emitter, only the schema part of a dataset is associated.

So on, the life-cycle of the dataset is shown under the lineage page.

Represents the flow of a dataset in the system with pipelines.

Initially, a basic lineage is shown. Then you have the option to expand the dataset or pipeline lineage to get more parent-child associations and flows.

It is visible on the Summary screen. The below example shows the lineage as follows:

JDBC_UserDS is used in PipelineBB, PipelineAA as channel
HDFS_EmitterDS01 is created by save as dataset in emitter of PipelineAA. Used as channel in PipelineCC and Pipeline DD

The dataset to pipeline arrow signifies that the dataset is used in the pipeline as a channel.

The pipeline to dataset arrow represents that the dataset is saved in the emitter of the pipeline.

Explore

It generates details about the data in the data window. Under Explore are two tabs:

Data
Profile

Data

Under the Data tab, you can view the rules, and the dataset with schema. This tab is explained under Rules.

Profile

The profile pane lists all the variables in your Dataset. This section also shows various statistical insights on each variable like Avg, Min, Max, Percentile, etc. You can also click on the ‘Frequency Distribution Details’ Label to see the frequency distribution corresponding to every variable.

Frequency Distribution Details:

Frequency distribution of any attribute/field is the count of individual values for that field in whole Dataset.

For Numeric type fields, it is shown in terms of counts only.

For String/Date/Timestamp: You can view the frequency/counts along with its percentage.

By default, only 10 distinct values are shown. But, it can be changed by updating sax.datasets.profile.frequency.distribution.count.limit from the Superuser Configuration.

As shown below, you can click on the bar of Frequency Distribution and it expands with a graph.

The Frequency Distribution Graph is generated for every variable in the dataset.

If you have any feedback on Gathr documentation, please email us!

Datasets List

View Dataset #

Summary #

Schema #

Profile History #

Run Profile Status #

Configure Profile #

Schedule Job #

Dataset Lineage #

Explore #

Data #

Profile #