Create Datasets

A dataset can be created while configuring a Data Source or Emitters within a data pipeline.

While configuring a Data Source, under Detect Schema, you will get an option to create a dataset. (as shown in image below)

Select, Save As Dataset, and enter a unique name for the Dataset.

While configuring an Emitter, the option to save a dataset is under Configuration, as shown:

The name of the dataset has to be unique, else an error message is shown, “Dataset name already exists”.

For both Data Source and Emitter, the user has the option to override credentials for the connection field on the configuration tab.

If opted, the user will need to provide any other valid credential details for connection authentication.

💡

Dataset creation is supported on the following Data sources and Data Sinks:

Batch DFS Data Source and Batch DFS Emitter
Hive Data Source and Hive Emitter
JDBC Data Source and JDBC Emitter
S3 Data Source

💡

Dataset creation is also supported on GCS and BigQuery for Gathr Google cloud platform.

Dataset Version and Disassociation

Whenever a dataset is created, a version is created. If you edit the components (that enabled dataset versioning), a new version is created. The only change you can make in a dataset is that you can edit the rules.

Also, Dataset Versions can be created based on schema and rule changes. The schema, rules and lineage are then listed version-wise.

The following changes will create a new version of a dataset:

Modifying the rules of a dataset will create a new version.
Changing the dataset schema at a pipeline emitter will create a new version. (The dataset is created from the emitter)
Changing the configuration of a dataset at a pipeline emitter will disassociate the dataset.

While you create a Dataset on an existing channel or edit the configuration of the above-mentioned components, the system prompts a relevant warning message before you accidentally overwrite an already created dataset:

The final confirmation of Dataset creation is reflected in Green:

Once a new version is created, you can view the version list, under View Dataset > Summary. A sub-section opens in the same window with the version list and the Schema details.

Schema Details consist of Alias and Datatype.

Alias is the name of the field and Datatype is the data type of the field value, as shown in the above screen shot.

👉

- Changing Dataset name will create a new Dataset.
- Dataset names should be unique all over the workspace.

External Dataset

A dataset is a combination of schema and a set of rules on the schema. You can create and reuse a dataset so that the pipeline runs with a pre-defined schema and a set of rules that are already configured in the dataset.

You can create a dataset externally from the Dataset option within a project.

An external dataset helps get insights into the data and analysis on each attribute of the dataset. This dataset can further be used in the desired pipelines.

Below are the steps to create and view datasets externally:

Create an External Dataset

To create a new dataset, click the plus icon on the top right screen.

Enter the name of the dataset to be created under General, followed by the Description and click Next (as shown below).

The dataset name has to be unique.

💡

The user can define the scope of the dataset by selecting either Project or Workspace. User can use the created dataset within the project or anywhere in the workspace. However, if the user selects a project, then the created dataset will be visible only in the specific project.

Source Configuration

You can create an external Dataset through the below mentioned data sources: HDFS, Hive, JDBC, S3, File System or SFTP, ADLS.

💡

Dataset creation is also supported on GCS and BigQuery for Gathr Google cloud platform.

Create a Dataset through ADLS

Provide values for the following properties:

Field	Description
Source	Option to select the source. Available options in the drop down list are: ADLS, BigQuery, File System, GCS, HDFS, Hive, JDBC, S3, SFTP. The source for which dataset is to be created, i.e., ADLS.
Connection Name	Connections are the service identifiers. Select the connection name from the available list of connections from where you would like to read the data.
Override Credentials	Select the override credentials option check-box for overriding the credentials.
Authentication Type	Azure ADLS authentication type.
Account Name	Provide a valid Azure ADLS account name.
Account Key	Provide a valid account key. You have an option to test the connection by clicking at the TEST CONNECTION button.
Container	ADLS data name from which the connection should be read.
ADLS Directory Path	Provide directory path for ADLS file system.
File Filter	Provide a file pattern. File filter is used to only include files with file names matching the pattern. For e.g .pdf or emp *.csv
Recursive File Lookup	Check the option to retrieve the files from current/sub-folder(s).
Type of Data	Data can be fetched in the form of CSV/TEXT/JSON/XML/Fixed Length, Avro, Binary, ORC and Parquet file.
Header Included	Check mark this option to include the header.

Create a Dataset through HDFS

Enter values for the following properties.

Field	Description
Source	Option to select the source. Available options in the drop down list are: ADLS, BigQuery, File System, GCS, HDFS, Hive, JDBC, S3, SFTP. The source for which dataset is to be created here is HDFS.
Connection Name	Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data.
Override Credential	Option to override existing connection credentials. This option is unchecked by default. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the HDFS data source connection can be authenticated: Username, KeyTab Select Option and KeyTab File Path. 👉 Make sure to provide valid credentials if Override Credentials option is chosen.
HDFS File Path	HDFS path from where data is read.
File Filter	Provide a file pattern. File filter is used to only include files with file names matching the pattern. For e.g .pdf or emp *.csv
Recursive File Lookup	Check the option to retrieve the files from current/sub-folder(s).
Type of Data	Data can be fetched in the form of CSV/TEXT/JSON/XML/Fixed Length/Binary,Avro and Parquet file.

Create a Dataset through Hive

Proceed with entering values for the below properties.

Field	Description
Source	Option to select the source. Available options in the drop down list are: ADLS, BigQuery, File System, GCS, HDFS, Hive, JDBC, S3, SFTP. The source for which dataset is to be created, i.e., Hive.
Connection Name	Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data.
Override Credential	Option to override existing connection credentials. This option is unchecked by default. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the Hive data source connection can be authenticated: Username, KeyTab Select Option and KeyTab File Path. 👉 Make sure to provide valid credentials if Override Credentials option is chosen.
Query	Write a Hive compatible SQL query to be executed in the component.
Refresh Table Metadata	If you want to refresh table metadata or sync updated dfs partition files with metastore.

Create a Dataset through JDBC

Create a dataset through JDBC by selecting JDBC and below mentioned properties.

Field	Description
Source	Option to select the source. Available options in the drop down list are: ADLS, BigQuery, File System, GCS, HDFS, Hive, JDBC, S3, SFTP. The source for which dataset is to be created, i.e., JDBC.
Connection Name	Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data.
Override Credential	Option to override existing connection credentials. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the JDBC data source connection can be authenticated: Username and Password. 👉 Make sure to provide valid credentials if Override Credentials option is chosen.
Query	Write a JDBC compatible SQL query to be executed.
Enable Query Partitioning	Tables will be partitioned and loaded in RDDs if this check-box is checked. This enables parallel reading of data from the table.

Create a Dataset through S3

To create a dataset through S3 Source.

Field	Description
Source	Option to select the source. Available options in the drop down list are: ADLS, BigQuery, File System, GCS, HDFS, Hive, JDBC, S3, SFTP. The source for which dataset is to be created, i.e., S3.
Connection Name	Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data.
Override Credential	Option to override existing connection credentials. If check-marked, the user will need to provide any other valid credential details for the below mentioned fields through which the S3 data source connection can be authenticated: AWS KeyId and Secret Access Key. 👉 Make sure to provide valid credentials if Override Credentials option is chosen.
S3 Protocol	Protocols available are S3, S3n, S3a
Bucket Name	Buckets are storage units used to store objects, which consists of data and meta-data that describes the data.
Path	File or directory path from where data is to be read.
File Filter	Provide a file pattern. File filter is used to only include files with file names matching the pattern. For e.g .pdf or emp *.csv
Recursive File Lookup	Check the option to retrieve the files from current/sub-folder(s).
Type of Data	Data can be fetched in the form of CSV/TEXT/JSON/XML/Fixed Length/Binary/Parquet/ORC, Avro file.
Is Header Included	Is header included in the CSV file.

Create a Dataset through File System

Create a dataset through the File System option. Upload a file and select the Type of Data.

Create a Dataset through SFTP

Create a dataset through the SFTP option.

Field	Description
Source	Option to select the source. Available options in the drop down list are: ADLS, BigQuery, File System, GCS, HDFS, Hive, JDBC, S3, SFTP. The source for which dataset is to be created, i.e., SFTP.
Connection Name	Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data.
Override Credential	Select the override credentials option check-box for overriding the credentials.
Username	Provide the username that has access to the data.
Password	Provide the user password.
Select PEM Options	Specify PEM file path or upload a PEM file.
PEM File Location	If you specify the PEM file path then provide the PEM File Location.
PEM Password	Provide PEM password.
File Path	Provide the file path of SFTP file system.
Incremental Read	Select the checkbox to read data incrementally. Reads latest file incase of folder.
Parallelism	Number of parallel threads run to download file from SFTP.
Is Compressed	Select for file format * .zip, * .tar, * .tar.gz
File Filter	Provide a file pattern. File filter is used to only include files with file names matching the pattern. For e.g .pdf or emp *.csv
Recursive File Lookup	Check the option to retrieve the files from current/sub-folder(s).
Type of Data	Data can be fetched in the form of CSV/TEXT/JSON/XML/Fixed Length/Binary file.
Header Included?	Option available if CSV is selected as Type of Data.

You can add further configurations as key-value pair by clicking at the ADD CONFIGURATION button. (Optional)

Create a Dataset through GCS

Create a dataset by selecting GCS Source:

Field	Description
Source	Option to select the source. Available options in the drop down list are: ADLS, BigQuery, File System, GCS, HDFS, Hive, JDBC, S3, SFTP. The source for which dataset is to be created, i.e., GCS.
Connection Name	Connections are the service identifiers. Select the connection name from the available list of connections, from where you would like to read the data.
Override Credential	Select the override credentials option check-box for overriding the credentials.
Service Account Key File	Upload GCP service account key file.
Bucket Name	Provide Google storage bucket name.
Path	Select the meta data file. End path with * in case of directory. For example: outdir.*
File Filter	Provide a file pattern. File filter is used to only include files with file names matching the pattern. For e.g .pdf or emp *.csv
Recursive File Lookup	Check the option to retrieve the files from current/sub-folder(s).

Create a Dataset through BigQuery

Detect Schema

After the source is configured, the data from the source is represented as a Schema. This process is called detect schema. The schema is then divided in Columns with Column Name, Column Alias, Data Type and Sample Values.

Column Alias and Data Type is editable at this page.

Once the schema is detected, you can click on Create from the top right corner of the UI, as shown in the image above. You will be notified that the Dataset has been successfully created.

As soon as you click on Create, you are navigated to the Data window under Explore (View Dataset page), where the schema is listed on the RULES window.

Rules

The Rules displayed under the DATA window show all the records of the dataset, as shown below.

The above screen is divided into these sections:

Columns of Data
Actions on Columns
Rules
Analyze
Statistics
Unique Values

Let us go through each section in detail.

Columns

A column is divided into four parts. Every column has the following sections and you can sort the data by using the operations.

Data Type:
Supported data types are:
- Date and Timestamp
- Numeric
- String
- Boolean
For example, in the image shown above, the supported data type is denoted by abc for String data type.
Operations
This option is at the top right of the column. The gear icon, which opens a slide window with the following operations that can be performed on the respective column of the schema.
Whenever any operation is applied on a column (s), a set of Rules is created and reflected in the right section of Rules.
a. Filter the values
Filter the values based on either of the shown filters, such as Equals, defining a range in Range, and so on. The Custom filter can be used for a custom value.
b. Transform
Transform filter can be applied to transform and filter the data based on Uppercase, Lowercase, Trim the values, and so on, as shown in the image below.
c. Missing Value Replacement:
Replace the missing or null values with either the Literal or Expression value.
Literal: Replaces Null and/or empty string values with a specified string literal
Expression: Replaces Null and/or empty string values with a specified expression value.
d. Pivot
Pivot the columns, where PIVOT is a relational operator that converts data from row level to column level.
e. Group By
To group the columns together, this filter can be used.
f. Rename Column
Rename the column name.
g. Create New Column
Create a new column using this filter.
h. Remove Column
Remove the selected column.
Sort
Along with Filters, you can also sort the columns Alphabetically, from A to Z or Z to A.

Actions on Columns

The actions that can be performed on all the columns together are as follows:

Create Column: Create a column using this icon.
Keep/Remove Columns: Keep or remove a column in the schema.
Display Columns: Display the selected columns in the schema (Note that the hidden columns will be still a part of the schema, just not displayed.)
Toggle Rules: Toggle the list of the displayed rules.
Search Value: Search for a value in the schema.

Rules

Rules are conditions/actions applied on the columns to modify dataset as per the requirements. You can view the Rules applied in the right navigation panel.

The first time a dataset is created, the version of the dataset is version 0.

As soon as you create a rule and save the changes, the version keeps going n+1.

Once you have applied the changes in the rules, click on the Save icon in the Rules navigation bar to save the changes.

To edit an existing dataset, open the dataset and click the Explore button, you can configure new rules or reconfigure the existing rule.

For example, in the below image, you can see the two rules applied with two sub-sections: Analyze and Unique Values.

The two applied rules are:

Selling Price less than 5
Profit Percentage greater than 1

These rules can be deleted by clicking on the cross, next to each rule.

All the rules can be deleted by clicking on the Delete icon above Rules.

As per the rules applied to the Columns, the Analyze window and Unique Values window also change their values.

Under Analyze, you can view the Null values of the column on which the last rule was applied.

Under Statistics, you can view the mathematical statistic/value of the entire column in the form of:

Minimum
Maximum
Mean
Median
Standard Deviation
Mode
Distinct
Sum
Range

Under Unique Values, you will get the unique values in the column and when you hover your mouse on either bar, it will show the count of the unique value.

You can also sort the count in Descending, Ascending, or alphabetically.

After applying the changed rules, a new version of the Dataset will be created.

💡

- You can view a list of all the saved datasets in a workspace.
- The data from HDFS will be fetched in all cases by using Fetch From Source (hidden property).

If you have any feedback on Gathr documentation, please email us!

Create Datasets

Dataset Version and Disassociation #

External Dataset #

Create an External Dataset #

Source Configuration #

Create a Dataset through ADLS #

Create a Dataset through HDFS #

Create a Dataset through Hive #

Create a Dataset through JDBC #

Create a Dataset through S3 #

Create a Dataset through File System #

Create a Dataset through SFTP #

Create a Dataset through GCS #

Create a Dataset through BigQuery #

Detect Schema #

Rules #

Columns #

Actions on Columns #

Rules #

Dataset Version and Disassociation

External Dataset

Create an External Dataset

Source Configuration

Create a Dataset through ADLS

Create a Dataset through HDFS

Create a Dataset through Hive

Create a Dataset through JDBC

Create a Dataset through S3

Create a Dataset through File System

Create a Dataset through SFTP

Create a Dataset through GCS

Create a Dataset through BigQuery

Detect Schema

Rules

Columns

Actions on Columns

Rules