Data Science

At times you may need to derive insights from structured or unstructured data. With data science, you can interpret data for decision-making to provide meaningful information from volumes of data.

Data science is a field that comprises everything related to data cleansing, preparation, and analysis. It is the method used to extract insights and information from data.

Gathr incorporates data science techniques for deriving meaningful information from data. Using Gathrโ€™s Machine Learning (ML) processors, one can train models and score models for streaming and batch data.

Making predictions on real-time data streams or a batch of data involves building an offline model and applying it to a stream. Models incorporate one or more machine learning algorithms trained using the collected data.

Models_Intro

Machine Learning Processors

ML Analytics processors enable you to use predictive models built on top of the ML package.

ML_Processors

ML provides a higher-level API built on data frames that helps you create and tune practical machine learning pipelines.

This topic is divided into the following sections:

  • Model Training

  • Model Scoring (Prediction)

Model Training

Models can be trained through Gathr with the help of ML processors. These models will be built on top of an ML package.

You can connect multiple models of different or same algorithms and train models on the same message in a single pipeline.

Intermediate columns calculated through transformations from one analytics processor will not be available on the next analytics processor if multiple analytics processors are trained in one pipeline.

Algorithms

In Gathr, there are eight algorithms under ML that support Model Training and Scoring.

Refer to each algorithm to know about configuration details.

The data flow for all these models includes a wizard-like flow. The same is shown in the figure below:

Models_ML_Flow

Configuration Section

Configuration Section โ†’

Feature Selection

Feature Selection โ†’

Pre-Processing

Pre-Processing โ†’

Post-Processing

Post-Processing โ†’

Model Evaluation

Model Evaluation โ†’

Hyper Parameters

Hyper Parameters โ†’

Model Scoring (Prediction)

Once the model is trained using training pipelines, it is registered to be used for scoring in any pipeline.

To use a trained model in a pipeline for scoring, drag and drop the analytics processor and change the mode of analytics processor from training to prediction.

Field NameDescription
OperationType of operation to be performed by Operator.

It could be Training or Prediction operation.
Algorithm TypeSelect the Model Class, it could be Regression or Classification.
Model NameName of the model to be created when the training model is activated or the model name to be used for the prediction when the prediction mode is activated.
Detect AnomaliesSelect to detect anomalies in the input data.
Anomaly ThresholdThis is the threshold distance between a data point and a centroid. If any input data pointโ€™s distance to its nearest centroid exceeds this value then that data point will be considered as an anomaly.
Is Anomaly VariableInput message field that will contain the result of anomaly test i.e. it will be true if a data record is an anomaly or false otherwise.
Top