Scala Processor
Scala is a general-purpose programming language built on the Java virtual machine.
It can interact with data that is stored in a distributed manner. Also, it can be used to process and analyze big data.
The Scala processor can be used for writing custom code in Scala language.
Please email to Gathr Support to enable the Scala Processor.
There are several Code Snippets that are available in the application and the same are explained in this topic to get you started with the Scala processor.
Processor Configuration
Configure the processor parameters as explained below.
Package Name
Name of the package for the Scala code class.
Class Name
Name of the Class for the Scala code.
Imports
Import statements for the Scala code.
Input Source
Input Source for the Scala Code.
Scala Code
Scala code to perform the operations on the JSON RDD object.
Ask AI Assistant
Use the AI assistant feature to simplify the creation of Scala queries.
It allows you to generate complex Scala queries effortlessly, using natural language inputs as your guide.
Describe your desired expression in plain, conversational language. The AI assistant will understand your instructions and transform them into a functional SQL query.
Tailor queries to your specific requirements, whether it’s for data transformation, filtering, calculations, or any other processing task.
Note: Press Ctrl + Space to list input columns and Ctrl + Enter to submit your request.
Input Example:
Select those records whose last_login_date is less than 60 days from current_date.
Jar Upload
Jar file is generated when you build Scala code.
Here, you can upload the third party jars so that the API’s can be utilized in Scala code by adding the import statements.
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
Code Snippets
Described below are some sample Scala code use cases:
Add a new column with constant value to existing dataframe
Description
This script demonstrates how to add a column with a constant value.
In the sample code given below, a column Constant_Column
with value Constant_Value
is added to an existing dataframe.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions._
dataset.withColumn("Constant_Column", lit("Constant_Value"))
Example
Input Data Set
CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary |
---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 |
15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 |
15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 |
15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 |
Output Data Set
CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary | Constant_Column |
---|---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 | Constant_Value |
15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 | Constant_Value |
15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 | Constant_Value |
15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 | Constant_Value |
Add a new column with random value to existing dataframe
Description
This script demonstrates how to add a column with random values.
Here, column Random_Column
is added with random integer values.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions.rand
dataset.withColumn("random", (rand * 100).cast("bigint"))
Example
Input Data Set
CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary |
---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 |
15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 |
15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 |
15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 |
Output Data Set
CustomerId | Surname | CreditScore | Geography | Age | Balance | EstSalary | Random_Column |
---|---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | 34 | 0 | 6534.18 | 0.0241309661 |
15604348 | Allard | 710 | Spain | 22 | 0 | 99645.04 | 0.5138384557 |
15693683 | Yuille | 814 | Germany | 29 | 97086.4 | 197276.13 | 0.2652246569 |
15738721 | Graham | 773 | Spain | 41 | 102827.44 | 64595.25 | 0.8454138247 |
Add a new column using expression with existing columns
Description
This script demonstrates how to add new column using existing columns.
Here, column Transformed_Column
is added by multiplying columns EstimatedSalary
with Tenure
.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions._
dataset.withColumn("Transformed_Column", col("EstimatedSalary") * col("Tenure"))
Example
Input Data Set
CustomerId | Surname | CrScore | Age | Tenure | Balance | EstimatedSalary |
---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | 34 | 9 | 0 | 6534.18 |
15604348 | Allard | 710 | 22 | 8 | 0 | 99645.04 |
15693683 | Yuille | 814 | 29 | 8 | 97086.4 | 197276.13 |
15738721 | Graham | 773 | 41 | 9 | 102827.44 | 64595.25 |
Output Data Set
CustomerId | Surname | CrScore | Age | Tenure | Balance | EstimatedSalary | Transformed_Column |
---|---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | 34 | 9 | 0 | 6534.18 | 58807.62 |
15604348 | Allard | 710 | 22 | 8 | 0 | 99645.04 | 797160.32 |
15693683 | Yuille | 814 | 29 | 8 | 97086.4 | 197276.13 | 1578209.04 |
15738721 | Graham | 773 | 41 | 9 | 102827.44 | 64595.25 | 581357.25 |
Transform an existing column
Description
This script demonstrates how to transform a column.
Here, rounding off values of column Balance
and converting to integer.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
import org.apache.spark.sql.functions._
dataset.withColumn("Balance" , round(col("Balance")).cast("int"))
Example
Input Data Set
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance |
---|---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | Male | 34 | 9 | 0 |
15604348 | Allard | 710 | Spain | Male | 22 | 8 | 0 |
15693683 | Yuille | 814 | Germany | Male | 29 | 8 | 97086.4 |
15738721 | Graham | 773 | Spain | Male | 41 | 9 | 102827.44 |
Output Data Set
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance |
---|---|---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | Male | 34 | 9 | 0 |
15604348 | Allard | 710 | Spain | Male | 22 | 8 | 0 |
15693683 | Yuille | 814 | Germany | Male | 29 | 8 | 97086 |
15738721 | Graham | 773 | Spain | Male | 41 | 9 | 102827 |
Filter data on basis of some condition
Description
This script demonstrates how to filter data on the basis of some condition.
Here, Customers having Age>30
are selected.
Sample Code
val sparkSession=dataset.sparkSession
import sparkSession.implicits._
dataset.select($"*").filter($"Age">30)
Example
Input Data Set
CustomerId | Surname | CreditScore | Geography | Gender | Age |
---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | Male | 34 |
15604348 | Allard | 710 | Spain | Male | 22 |
15693683 | Yuille | 814 | Germany | Male | 29 |
15738721 | Graham | 773 | Spain | Male | 41 |
Output Data Set
CustomerId | Surname | CreditScore | Geography | Gender | Age |
---|---|---|---|---|---|
15633059 | Fanucci | 413 | France | Male | 34 |
15738721 | Graham | 773 | Spain | Male | 41 |
If you have any feedback on Gathr documentation, please email us!