Cassandra ETL Source

💡

Cassandra connector is available on request to Gathr users.

See the Connector Marketplace topic. Please request your administrator to start a trial or subscribe to the Premium Cassandra connector.

In Gathr, it can be added as a channel to help in fetching customers’ and prospects’ data and transform it as needed before storing it in a desired data warehouse to run further analytics.

Schema Type

See the topic Provide Schema for ETL Source → to know how schema details can be provided for data sources.

After providing schema type details, the next step is to configure the data source.

Data Source Configuration

Configure the data source parameters as explained below.

Connection Name

Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Cassandra earlier. Or create one as explained in the topic - Cassandra Connection →

Use the Test Connection option to ensure that the connection with the Cassandra channel is established successfully.

A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve the issue before proceeding further.

Schema Name

Schema will list as per the configured connection.

Select the schema to be read from.

Entity

Tables in Cassandra are statically defined to model Cassandra entities.

If you selected the Fetch From Source method to design the application, the Entities will list as per the configured connection. Select the entity to be read from Cassandra.

If you selected the Upload Data File method to design the application, the exact name of the entity should be provided to read the data from Cassandra.

If you selected the Fetch From Source method to design the application, the Fields would list as per the Entity chosen in the previous configuration parameter. Select the fields or provide a custom query to read the desired records from Cassandra.

Fields

The conditions to fetch source data from a Cassandra table can be specified using this option.

Select Fields: Select the column(s) of the entity that should be read.

Custom Query: Provide an SQL query specifying the read conditions for the source data.

Example: SELECT "Id" FROM Companies

If you selected the Upload Data File method to design the application, provide a custom query to fetch records from the Cassandra entity specified in the previous configuration.

Query

The conditions to fetch source data from a Cassandra table can be specified using this option.

Provide an SQL query specifying the read conditions for the source data.

Example: SELECT "Id" FROM Companies

Read Options

This section contains additional read options that can be configured on need basis.

Aggregations Supported

Whether or not to support aggregations in the Cassandra server. Note that in queries to the provider, you must use single quotes to define strings.

Allow Filtering

When true, slow-performing queries are processed on the server.

Cassandra by default does not allow filtering for queries that it predicts will have performance problems.

You can override the default behavior and rely on the server to process these queries by setting Allow Filtering to true.

Case Sensitivity

Enable case sensitivity to the CQL sending to the server, if set to True, the identifiers in the CQL will be enclosed in double quotation marks.

By default, SQL is case-insensitive. However, Cassandra supports case-sensitive table and column names. Setting this property to True will enable you to retrieve tables and columns based on their case-sensitive names.

Flatten Objects

Set Flatten Objects to true to flatten object properties into columns of their own. Otherwise, objects nested in arrays are returned as strings of JSON.

The property name is concatenated onto the object name with an underscore to generate the column name.

For example, you can flatten the nested objects below at connection time:

[
     { "grade": "A", "score": 2 },
     { "grade": "A", "score": 6 },
     { "grade": "A", "score": 10 },
     { "grade": "A", "score": 9 },
     { "grade": "B", "score": 14 }
]

When FlattenObjects is set to true and FlattenArrays is set to 1, the preceding array is flattened into the following table:

Column Name	Column Value
grades_0_grade	A
grades_0_score	2

Null To Unset

Use unset instead of NULL in CQL query when performing INSERT operations.

In Cassandra 2.2 and above, when executing an INSERT query, a parameter value can be set to unset. Cassandra does not consider unset field values which helps to avoid tombstones.

When NULL values are inserted, it is possible to reach the tombstone threshold limits which causes an exception to be thrown when querying the data. Setting this property to true and submitting unset values avoids these tombstones from being created.

Note: This option is only available on INSERT operations as Cassandra does not support changing existing values to unset.

Use Json Format

Whether to submit and return the JSON encoding for CQL data types.

Cassandra 2.2 introduced a CQL extension that allows you to JSON-encode CQL data types.

By default, you use the JSON syntax to manipulate data and SELECT statements return JSON through the connector.

Set this property to false to use CQL literals to interact with Cassandra data.

The syntax for CQL literals has several differences from JSON.

Example:

CQL strings are defined in single quotes, while JSON strings are defined in double quotes.
CQL sets, tuples, and lists are JSON-encoded as arrays.
User-defined types and CQL uuid types are JSON-encoded as objects.

Varint To String

Map Cassandra VARINT to String value.

Consistency Level

The consistency level determines how many of the replicas of the data you are interacting with need to respond for the query to be considered a success.

You need to specify the appropriate replicas in the Server property.

Below are the possible values:

ANY: At least one replica must return success in a write operation. This property guarantees that a write never fails; this consistency level delivers the lowest consistency and highest availability.
ALL: All replicas must respond. This property provides the highest consistency and the lowest availability.
ONE: At least one replica must respond. This is the default and suitable for most users, who do not typically require high consistency.
TWO: At least two replicas must respond.
THREE: At least three replicas must respond.
QUORUM: A quorum of nodes must respond. The QUORUM properties provide high consistency with some failure tolerance.
EACH_QUORUM: A quorum of nodes must respond where a quorum is calculated for each data center. This setting maintains consistency in each data center.
SERIAL: A quorum of replicas performs a consensus algorithm to allow lightweight transactions.
LOCAL_ONE: At least one replica in the local data center must respond.
LOCAL_SERIAL: The consensus algorithm is calculated for the local data center.
LOCAL_QUORUM: A quorum of nodes must respond where the quorum is calculated for the local data center.

Flatten Arrays

By default, nested arrays are returned as strings of JSON.

The FlattenArrays property can be used to flatten the elements of nested arrays into columns of their own.

Note: This is only recommended for arrays that are expected to be short.

Set FlattenArrays to the number of elements you want to return from nested arrays.

The specified elements are returned as columns. The zero-based index is concatenated to the column name. Other elements are ignored.

Example:

You can return an arbitrary number of elements from an array of strings:

["FLOW-MATIC","LISP","COBOL"]

When FlattenArrays is set to 1, the preceding array is flattened into the following table:

Column Name	Column Value
languages_0	FLOW-MATIC

Partitioning

This section contains partitioning-related configuration parameters.

Enable Partitioning

This enables parallel reading of the data from the entity.

Partitioning is disabled by default.

If enabled, an additional option will appear to configure the partitioning conditions.

Column

The selected column will be used to partition the data.

Max Rows per Partition: Enter the maximum number of rows to be read in a single request.

Example: 10,000

It implies that a maximum number of 10,000 rows can be read in one partition.

Advanced Configuration

This section contains additional configuration parameters.

Fetch Size

The number of rows to be fetched per round trip. The default value is 1000.

Add Configuration: Additional properties can be added using this option as key-value pairs.

Detect Schema

Check the populated schema details. For more details, see Schema Preview →

Pre Action

To understand how to provide SQL queries or Stored Procedures that will be executed during pipeline run, see Pre-Actions →.

Notes

Optionally, enter notes in the Notes → tab and save the configuration.

If you have any feedback on Gathr documentation, please email us!

Cassandra ETL Source

Schema Type #

Data Source Configuration #

Connection Name #

Schema Name #

Entity #

Fields #

Query #

Read Options #

Aggregations Supported #

Allow Filtering #

Case Sensitivity #

Flatten Objects #

Null To Unset #

Use Json Format #

Varint To String #

Consistency Level #

Flatten Arrays #

Partitioning #

Enable Partitioning #

Column #

Advanced Configuration #

Fetch Size #

Detect Schema #

Pre Action #

Notes #