This topic presents the side navigation panel, referred as the main menu and its features that can be used to perform several administrative tasks in Gathr. The illustration of the main menu is given below and the tasks that can be performed with these features are explained further in detail.
Note: The main menu is only displayed for the Superusers (System Admin) login.
gathr provides multi-tenancy support through Workspaces.
Superuser can create multiple workspaces and add users to a workspace. One user can be mapped to multiple workspaces.
A superuser can create number of workspaces.
Below are the steps to create a workspace.
1. Go to Manage Workspace and click on Create New Workspace. (The plus sign on the top right corner)
2. Enter the details in the following tab.
While creating the workspace, configure Databricks Token details and credentials.
For EMR, you can configure using AWS keys or the Instance Profile.
Configure with EMR cluster using AWS Keys: Select AWS Keys from the drop-down if you want to configure EMR cluster using AWS Key ID and Secret Access Key. AWS Key that will be used for all the communication with AWS.
Configure with EMR cluster using Instance Profile option:
Note: Users creation is based on Databricks and EMR access as well. There may be users that won’t have access to create a Databricks users, for them the Databricks tabs will be not be accessible and so will be the case for EMR users.
3. Click Create to save the changes and the new Workspace will be listed in the Manage Workspaces page.
To enter a workspace, click on the enter icon.
Note: There is no provision to edit or delete a workspace.
Once the user enters into a Workspace, similar components will appear on the Workspace landing page as explained earlier in the Getting Started topic.
To know more about the Workspace menu, see Projects, Manage Workspace Connections and Manage Workspace Users.
To navigate to the Workspace Connections page, the user can click on the Connections feature which is available in the workspace menu.
The user can create new connections at the workspace level in the same manner as it is explained in the manage superuser connections topic. To know more, see Manage Connections.
- Unique names must be used to create new connections inside a Workspace for similar component types. User will get notified in the UI if the specified connection name already exists.
- The visibility of default connections and the connections created by Superuser at any Workspace level is controlled by the Superuser.
- The connections created in a Workspace can be differentiated by the Workspace name in the list. The superuser created connections will appear in the list with the Workspace name as Superuser.
- Connections listed in a workspace can be used to configure features like Datasets, Pipelines, Applications, Data Validations, Import Export Entities & Register Entities inside a Project. While using the connections for above listed features, the superuser connections can be differentiated from other workspace created connections by the workspace name.
- Connections will not be visible and cannot be consumed outside of the workspace in which they are created.
A developer user can be created for a Workspace.
A developer user can perform unrestrictive operations within a workspace, such as operations of a DevOps role along with pipeline creation, updating and deletion.
To create a new Workspace User, go to Manage Users and select Create New User.
Following are the properties for the same: Only a Developer user can be created from within the Workspace. Enter a username that will be used to log in the Application. Enter an email id that will be used for any communication with the user.
Also, there will be options to configure the AWS Databricks and EMR as explained earlier.
Click the workspace icon in the upper right corner of the page to view a drop-down list of the workspaces. Choose the workspace from the list you wish to enter.
NOTE: There is no provision to delete any workspace.
User can manage the AWS Databricks and EMR clusters with this option.
All the existing clusters for Databricks and EMR will be listed in the Cluster List View page.
User has options to create interactive clusters, perform actions like start, refresh, edit and delete clusters, view logs and redirect to spark UI.
Given below is an illustration of the Databricks Interactive Clusters page followed by the steps to create a cluster.
Click CREATE CLUSTER to create Databricks cluster. Provide the below fields:
Databricks Interactive Clusters
Given below is an illustration of the Databricks Interactive Clusters page followed by the steps to create a cluster.
Click CREATE CLUSTER to create Databricks cluster. Provide the below fields:
Given below is an illustration of the Databricks Job Clusters page.
User has options to create long running clusters, fetch clusters from AWS, perform actions like start, edit and delete clusters, view logs and redirect to spark UI.
Given below is an illustration of the EMR Long Running Clusters page.
Given below is an illustration of the EMR Job Clusters page.
Click CREATE CLUSTER to create a fresh cluster. Provide the below fields for creating a new cluster.
Click FETCH CLUSTER FROM AWS option to fetch an existing cluster by selecting the cluster ID from the drop-down list.
To navigate to the Workspace Connections page, the user can click on the Connections feature which is available in the workspace menu.
The users with privilege to create connections can create new connections at the workspace level.
- Unique names must be used to create new connections inside a Workspace for similar component types. User will get notified in the UI if the specified connection name already exists.
- The visibility of default connections and the connections created by Superuser at any Workspace level is controlled by the Superuser.
- The connections created in a Workspace can be differentiated by the Workspace and Owner name in the list. The superuser created connections will appear in the list with the Workspace and Owner name as Superuser.
- Connections listed in a workspace can be used to configure features like Datasets, Pipelines, Applications, Data Validations, Import Export Entities & Register Entities inside a Project. While using the connections for above listed features, the superuser connections can be differentiated from other workspace created connections by a suffix, “global” which is given after the connection name.
- Connections will not be visible and cannot be consumed outside of the workspace in which they are created.
This option will only be visible if the containerEnabled property is enabled in the Sandbox configuration settings.
The user can register a desired cluster by utilizing the Register Cluster option. It can be done either by uploading a valid embedded certificate within a config file, or by uploading config file and certificates separately during registration process. The cluster once registered can be utilized across all workspaces while configuring a sandbox.
Currently, only Kubernetes clusters can be registered on Gathr.
On the Cluster Configuration listing page the existing clusters will be listed. Timestamp information about the cluster since the time it is up. The user can Edit/Unregister the registered cluster(s) information.
The user can register a cluster by clicking at the top right + icon.
Configure the cluster by providing the following details:
The user can TEST the cluster configuration and SAVE.
Upon successful registration, the registered cluster will get added in the listing page.
The option to register Container Images within Gathr are provided in the main menu as well as the workspace menu. It will only be visible if the containerEnabled property is enabled in the Sandbox configuration settings.
When user registers a container image, it will be visible as drop-down options in the sandbox configuration page inside project. These container images (sandbox) can be launched on the preferred container (for example, Kubernetes) to access the desired integrated development environments (Examples: Jupyter Lab, Visual Studio Code, Custom and Default) of the user’s choice on the sandbox.
Default IDE option will only be visible when the Register Container Image option is accessed by superuser via the main menu.
The container images that are registered from the main menu by the superuser can be utilized across all workspaces. Whereas, the container images that are registered from the workspace menu remain private to the specific workspace where it is registered.
Registered Container Images Listing
The container images that are registered will appear on the Registered Images page.
The information and actions displayed for the listed Container Images are explained below:
URI registered on container registry and accessible to the cluster. The user can Edit/Unregister the registered container image(s).
Steps to Register Container Image
The user can register a container image by clicking at the top right + icon.
Configure the container image by providing the following details:
Consider the below points for YAML file upload:
• Upload file with .zip extension.
• It should directly contain the valid YAML files.
• Use below expressions to populate YAML fields at runtime during sandbox configuration:
"@{<kind>:<field path>}" - The expression used to refer the specified field from any other YAML file.
Example: In "@{deployment:metadata.name}" expression, the first part "deployment" is kind (i.e., type of YAML) and the next part "metadata.name" is the field that is supposed to be fetched from the specified YAML type.
${value:"<default-value>",label:"<field label>"} - The expression used to display a dynamic field label along with a default value, which is editable.
${value:"sandbox-<<UUID>>",label:"Enter Sandbox Name"}
Field label will be: Enter Sandbox Name and default value will be: sandbox-A123.
"<<UUID>>" - This expression is used to generate a unique ID for a specific field.
In the above YAML configuration snippet, the BASE_PATH will always have a unique value generated via the "/<<UUID>>" expression.
Click REGISTER to complete the process. The registered image will appear in the listing page.
SETUP section is defined in Installation Guide. This section defines the properties of- Cluster Configuration, gathr settings, Database, Messaging Queue, Elasticsearch, Cassandra and Version Control.
Under SETUP, the user has an option to manage Version Control.
Configuration page enables configuration of gathr properties.
Note: Some of the properties reflected are not feasible with Multi-Cloud version of gathr. These properties are marked with **
Each sub-category contains configuration in key-value pairs. You can update multiple property values in single shot.
Update the values that you want then scroll down to bottom and click on Save button.
You will be notified with a successful update message.
Performs search operation to find property key or property value. You can search by using partial words of key labels, key names or key values.
The above figure shows, matching configuration values and count for the searched keyword “url”.
By hovering the mouse on a property label, a box will show the fully qualified name of the key and click on the i button for its description.
The above figure shows, matching configuration values and count for the searched keyword “url”.
Copy the fully qualified name of property key by clicking on key’s label as shown below.
The key name will be copied to clipboard.
gathr configuration settings are divided into various categories and sub-categories according to the component and technology.
Configurations properties related to application server, i.e. gathr web studio. This category is further divided into various sub-categories.
The type of database on which gathr database is created. Possible values are MySQL, PostgreSQL, Oracle.
The comma separated list of <IP>:<PORT> of all nodes in zookeeper cluster where configuration will be stored.
Search without specifying column names, takes extra space and time. Indexed data older than mentioned time in seconds from current time will not be fetched.
Security
Configurations properties related to application processing engines come under this category. This category is further divided into two sub-categories.
Configurations properties related to messaging brokers come under this category. This category is further divided into three sub-categories.
Configuration properties related to NoSQL databases come under this category. This category is further divided into two sub-categories:
Configurations properties related to search engines come under this category. This category is further divided into two sub-categories:
Configuration properties related to metric servers come under this category. This category is further divided into various sub-categories.
Configuration properties related to Hadoop, i.e. gathr web studio, come under this category. This category is further divided into various sub-categories.
The file system URI. For e.g. - hdfs://hostname:port, hdfs://nameservice, file://, maprfs://clustername The name of user through which the hadoop service is running.
Miscellaneous configurations properties left of the Web Studio. This category is further divided into various sub-categories.
You can add extra Java options for any Spark Superuser pipeline in following way:
Login as Superuser and click on Data Pipeline and edit any pipeline.
Once Kerberos is enabled, go to Superuser UI > Configuration > Environment > Kerberos to configure Kerberos.
Configure Kerberos in Components
Go to Superuser UI > Connections, edit the component connection settings as explained below:
By default, Kerberos security is configured for these components: Solr, Kafka and Zookeeper. No manual configuration required.
Note: For Solr, Kafka and Zookeeper, Security is configured by providing principals and keytab paths in keytab_login.conf. This file then needs to be placed in StreamAnalytix/conf/common/kerberos and StreamAnalytix/conf/thirdpartylib folders.
HDFS connection name use to connect HDFS (from gathr connection tab). URL contains IP address and port where Jupyter services are running.
All default or shared kind of configurations properties come under this category. This category is further divided into various sub-categories.
Defines maximum number of retries for the RabbitMQ connection. Defines the RabbitMQ exchange name for real time alert data.
The URL of FTP service to create the FTP directory for logged in user (required only for cloud trial).
Audit
Others
Connections allow gathr to connect to services like ElasticSearch, JDBC, Kafka, RabbitMQ and many more. A user can create connections to various services and store them in gathr application. These connections can then be used while configuring the services in various features of gathr which require these services connection details, for e.g., Data Pipelines, Dataset, Application.
To navigate to the Superuser Connections page, the user can click on the Connections feature which is available in the gathr main menu.
The default connections are available out-of-box once you install the application. All the default connections expect RabbitMQ are editable.
The user can use these connections or create new connections.
A superuser can create new connections using the Connections tab. To add a new connection, follow the below steps:
Select the component from the drop-down list for which you wish to create a connection.
For creating an ADLS connection, select ADLS from the Component Type drop-down list and provide connection details as explained below:
For creating a AWS IoT connection, select AWS IoT from the Component Type drop-down list and provide connection details as explained below. Shows all the available connections. Select AWS IoT Component type from the list. This is the AWS Key i.e. the credential to connect to AWS console.
For creating a Azure Blob connection, select Aure Blob from the Component Type drop-down list and provide connection details as explained below:
For creating a Cassandra connection, select Cassandra from the Component Type drop-down list and provide connection details as explained below:
For creating a Cosmos connection, select Cosmos from the Component Type drop-down list and provide connection details as explained below:
For creating a Couchbase connection, select Couchbase from the Component Type drop-down list and provide connection details as explained below:
For creating an Elasticsearch connection, select Elasticsearch from the Component Type drop-down list and provide connections details as explained below.
For creating an Elasticsearch connection, select Elasticsearch from the Component Type drop-down list and provide connections details as explained below.
Note: The user can add further configuration.
Click Create, to create the GCP connection.
Now, once the data pipeline is created, the user requires to configure job.
For creating an Hbase connection, select Hbase from the Component Type drop-down list and provide connections details as explained below.
For creating a HDFS connection, select HDFS from the Component Type drop-down list and provide connections details as explained below.
For creating a HIVE Emitter connection, Select HIVE Emitter from the Component Type drop-down list and provide connections details as explained below.
** Properties marked with these two asterix** are presents only in HDP3.1.0 environment.
The value of Hive Server2 URL will be the value of HiveServer2 Interactive JDBC url (given the in the screenshot). In the HDP 3.1.0 deployment, this is an additional property:
HiveServer2 Interactive JDBC URL: The value is as mentioned below:
For creating a JDBC connection, select JDBC from the Component Type drop-down list and provide connections details as explained below.
Note: JDBC-driver jar must be in class path while running a pipeline with JDBC emitter or while testing JDBC connection.
For creating a Kafka connection, select Kafka from the Component Type drop-down list and provide connections details as explained below.
For creating a Kinesis connection, select Kinesis from the Component Type drop-down list and provide other details required for creating the connection.
Users can also choose to authenticate Kinesis connections using Instance Profile option.
For creating a KUDU connection, select KUDU from the Component Type drop-down list and provide other details required for creating the connection.
For creating Mongo DB connection, select Mongo DB from the Component Type drop-down list and provide details required for creation the connection.
For creating an MQTT connection, select MQTT from the Component Type drop-down list and provide other details required for creating the connection.
For creating an OpenJMS connection, select OpenJMS from the Component Type drop-down list and provide other details required for creating the connection.
For creating a RedShift connection, select RedShift from the Component Type drop-down list and provide other details required for creating the connection.
For creating a RabbitMQ connection, Select RabbitMQ from the Component Type drop-down list and provide connections details as explained below.
For creating a DBFS connection, select DBFS from the Component Type drop-down list and provide other details required for creating the connection.
For creating a RDS connection, select RDS from the Component Type drop-down list and provide other details required for creating the connection.
For creating a S3 connection, select S3 from the Component Type drop-down list and provide other details required for creating the connection.
For creating an SQS connection, select SQS from the Component Type drop-down list and provide other details required for creating the connection.
For creating a Salesforce connection, select Salesforce from the Component Type drop-down list and provide other details required for creating the connection.
For creating a Socket connection, select Socket from the Component Type drop-down list and provide connections details as explained below.
For creating a Solr connection, Select Solr from the Component Type drop-down list and provide connections details as explained below.
For creating a Tibco connection, select Tibco from the Component Type drop-down list and provide connections details as explained below.
For creating a Twitter connection, select Twitter from the Component Type drop-down list and provide connections details as explained below.
For creating a Vertica connection, select Vertica from the Component Type drop-down list and provide connections details as explained below.
On updating a default connection, its respective configuration also gets updated.
In reverse of auto update connection, auto update configuration is also possible.
If you update any component’s configuration property, from Configuration Page, then the component’s default connection will also be auto updated.
For example: Updating RabbitMQ host URL configuration will auto update RabbitMQ Default connection.
This widget will show system alerts, with a brief description and its timestamp. You can also check the generated alerts on the UI along with email notifications.
From the top right drop-down arrow, select System.
This widget shows the alerts generated by a pipeline when it goes in error mode or killed from YARN.
System alerts shows two types of alerts.
Pipeline stopped Alerts: Alerts thrown when a Pipeline is killed from YARN.
Error Mode Alerts: Alerts thrown when the Pipeline goes in error mode.
You can apply an alerts on a streaming pipeline as well. You will see the description of the alert and its time stamp in this widget. The alert can have a customized description.
Audit Trail captures and presents all important activities and events in the platform for auditing. For more details about this feature, see Audit Trail.
User roles determines the level of permissions that are assigned to a user to perform a group of tasks.
Superuser credentials are provided at the time of deployment and purchase to the user.
Within a workspace, go to Manage Users and select Create New User.
Following are the properties for the Create User window:
Once the user fills the required parameters for User Details tab, they also have options to authenticate and connect to Databricks and EMR clusters.
Databricks authentication can be done using either the existing token or with a new token.
EMR authentication can be done using either AWS Keys or with the Instance Profile option.
Developer can perform unrestrictive operations within a workspace, such as operations of a DevOps role along with pipeline creation, updating and deletion.
NOTE: You cannot assign any developer role using manage users. Developer role can only be assigned via Superuser > Manage Workspace > Manage users.
To create a DevOps user, login as a superuser, go to Manage Users and click on create new user.
To create a Tier-II user, login as a superuser, go to Manage Users and click on Create New User.
In the User Role parameter select Tier-II from the drop-down and proceed to create the user same as explained for the DevOps User in the above topic.
A Tier-II user cannot perform following operations:
• Create, update, delete, play, pause and stop pipelines.
• Access to group, message and alerts configuration.
When you edit a user, you also have the option to dissociate the user from use of Databricks and EMR. Click on Databricks or EMR and click on Revoke Permission.
The user permissions may vary for each role according to the authentication and authorization done in the Security configurations tab.
LDAP Authentication and Authorization
The table below specifies the user permissions for the applicable user-roles in gathr functionality.
To change outline of any existing connection, components or pipelines, a developer has to manually edit JSON files residing in /conf/common/template directory of gathr bundle. Templates allow you to update these from UI. You can create many versions of them and switch to any desired version at any point. The changes in the outline of that component will be reflected immediately.
Components tab allows you to edit the JSON, view the type of component for Spark engine.
When you edit any component, Version, Date Modified and Comments added are viewable.
Connection tab allows you to edit the JSON and create as many versions as required.
When you edit any Connection, Version, Date Modified and Comments added are viewable.
Gathr Credit Points Consumption calculates the credits which is based on the number of cores used in the running jobs. It is a costing model which charges customer based on their usage.
The number of cores used is captured every minute for every running job. Currently the number of cores used is calculated for two types of clusters: EMR and Databricks.