Datasets List

The Dataset homepage shows a list of all the Datasets created. The below screenshot shows the same with the properties and description table below:

dataset-listing

Delete-Dataset

PropertyDescription
NameName of the dataset.
DescriptionDescription of the dataset.
Source TypeData source type.
Actions

View Dataset

When you click on the eye icon, a view-dataset window opens. This is the page where you can view the schema of the dataset, versions of the dataset and other options explained below in hierarchal form.

Delete

A new option to delete a Dataset is being introduced. Datasets that are not being used in any pipeline or validation jobs can be deleted with this new option.

View Dataset

Click on the view dataset eye icon under the Actions tab of the Dataset homepage. A window will open with the details of the dataset, with two following tabs; Summary and Explore.

datasetlisthomepage

Summary

The Summary button shows the summary of your Dataset. The right panel is the description window, which can be edited and saved in the same window. Left Panel has the Data Source or Emitter Dataset details, where you can view schema details, connection detail and other properties, explained in the table below:

The screenshot under View Dataset, shows the Dataset details and active version of DFS Data Source, under the About button. Below mentioned table describes all the properties of the Summary tab button.

PropertyDescription

Dataset Name

Last Read From SourceLast time the data was read from the data source. The date and time are mentioned here.
Last Profile RunIt shows when the last Dataset profile was generated. The profile is generated when you run the dataset profile from Run Profile.
File Path

The file path and the format of data type, where the data will be read while generating profile.

In few components, HDFS path is replaced by Query/ Database name or Table name and the configured query is reflected here.

For example, in case of JDBC, the query mentioned here will run while generating the profile.

In case of HDFS, the configured path is where the data will be read from.

Note: All the above properties should be re-configured when a new dataset version is created.

Number of ColumnsNumber of columns from the data.
Number of RecordsThe total number of records when the last profile was generated in the data source.
Schema and RulesThis click-able button opens a window at the bottom of this pane and displays the versions and corresponding rules applied on the dataset. In the Schema window, you can view the Alias and the Datatype of the schema.
Profile HistoryNumber of times the profile was generated. You can view the respective results.
VersionThe latest version of the dataset.
Run Profile Status

This tab shows the current state of the profile, for if it is in execution state or stopped state.

There are two buttons, Play and Stop. These buttons allow you to play and stop the profile.

The menu button, has two options:

Schedule Job

Configure Job

They are explained below.

Tags

Associate tags with the dataset.

Tags can also be updated from the same window.

DescriptionThe description provided while creating a dataset. You can edit the description within this window, and a “Description updated successfully” message will pop-up.

Dataset Lineage

Select VersionSelect the version of the dataset that you want to view.
AliasName of the fields
DatatypeDatatype of the field (Int, string)

When you do a mouse-hover on the Path details, it gives the name of the connection, as shown below:

connectionName

Schema

Schema opens a new window beneath the Schema panel. Select the Versions of the dataset and it will list the Aliases with their Data Types.

versionList

Profile History

Profile History opens a new window beneath the View Schema panel. A tabular form of profile history is shown with details of the Dataset profile:

PropertyDescription
VersionVersion number of the Dataset.
Number of ColumnsNumber of columns in the Dataset.
Number of RecordsNumber of records in the Dataset
Last Profile RunThe date and time on which the Profile was run.
ActionView the profile results.

Run Profile Status

Run Profile Status shows the current state of the profile execution, for if it is in Starting/Active/or Stopped mode.

You can Stop and Play the profile using the respective buttons as well.

runprofilestatusbuttons

A pipeline gets submitted on the cluster. This pipeline will have a nomenclature as explained below:

System prefix of the Pipeline_Dataset Name_DatasetVersion_Timestamp.

For example, SAx\_DatasetProfileGenerator\_IrisInputData\_0\_1559296535220

This pipeline will be submitted as a batch job in the cluster.

The options available under Run Profile Status are:

PropertyDescription
Cluster ConfigurationThe details of cluster configuration is available.
Configure JobThe configure Job window opens. Below options are available on the Configure Job window.
NameDescription
Select ClusterSelect IBM Conductor cluster for job configuration.
Instance Group NameSelect Instance Group to configure spark job.
Spark Master UrlMaster URL to submit or view spark job.

SPARK CONDUCTOR EGO PROPERTIES

IBM Conductor EGO Configuration

NameDescription
Executor Maximum SlotsMaximum number of executor slots.
Executor Idle TimeSpecifies the duration (in seconds) for executor state to remain alive without any workload running on it.
Maximum SlotsSpecifies maximum number of slots that an appication can get in master node.
Slots Per TaskSpecifies number of slots that are allocated to a task.
GPU Max SlotsSpecifes maximum number of slots that an application can get for GPU task in a master node.
PrioritySpecifies the priority of driver and executor scheduling for spark instance group. Valid range is 1-10000. Default is 5000.

To add Environment Variable, click the ADD ENVIRONMENT VARIABLE BUTTON. Option to ADD CONFIGURATIONS is also available.

Click CONFIRM once details are provided in the Configure Job window.

|Stop|Number of records in the Dataset| |Run Profile |The date and time on which the Profile was run.|

Along with executing the profile, you can also configure the job and schedule the job, as explained below:

Configure Profile

User can tune the job in this window by providing driver and executer related parameters. To know more, see the Configure Profile field in the table Actions on Pipeline.

Configure_Job

To check the errors on this job, you can also configure error properties from this window.

configureJob

Schedule Job

Schedule job enables a dataset to run a Job as per the defined cron expression.

schedulenew

Once you are defining a cron expression, you will have the option to Schedule a job and once it is scheduled, then an UN-SCHEDULE and RESCHEDULE button will be available.

Dataset Lineage

The Dataset Lineage window lists all the versions of the dataset and its complete life cycle.

You can view the dataset lineage by selecting the version. Data Lineage represents the association of the dataset in the pipeline. It shows the used channel or emitter used within the data pipeline.

An association is defined if dataset schema and rules are used in the channel. This helps to use the same entities in multiple pipeline channels, as Use Existing Dataset.

In the case of an emitter, only the schema part of a dataset is associated.

So on, the life-cycle of the dataset is shown under the lineage page.

Represents the flow of a dataset in the system with pipelines.

Initially, a basic lineage is shown. Then you have the option to expand the dataset or pipeline lineage to get more parent-child associations and flows.

It is visible on the Summary screen. The below example shows the lineage as follows:

  • JDBC_UserDS is used in PipelineBB, PipelineAA as channel

  • HDFS_EmitterDS01 is created by save as dataset in emitter of PipelineAA. Used as channel in PipelineCC and Pipeline DD

The dataset to pipeline arrow signifies that the dataset is used in the pipeline as a channel.

The pipeline to dataset arrow represents that the dataset is saved in the emitter of the pipeline.

LIneage1

Explore

It generates details about the data in the data window. Under Explore are two tabs:

  • Data

  • Profile

Data

Under the Data tab, you can view the rules, and the dataset with schema. This tab is explained under Rules.

Profile

The profile pane lists all the variables in your Dataset. This section also shows various statistical insights on each variable like Avg, Min, Max, Percentile, etc. You can also click on the ‘Frequency Distribution Details’ Label to see the frequency distribution corresponding to every variable.

Frequency Distribution Details:

Frequency distribution of any attribute/field is the count of individual values for that field in whole Dataset.

For Numeric type fields, it is shown in terms of counts only.

For String/Date/Timestamp: You can view the frequency/counts along with its percentage.

By default, only 10 distinct values are shown. But, it can be changed by updating sax.datasets.profile.frequency.distribution.count.limit from the Superuser Configuration.

As shown below, you can click on the bar of Frequency Distribution and it expands with a graph.

The Frequency Distribution Graph is generated for every variable in the dataset.

frequency

Top