Couchbase Ingestion Source

💡

Couchbase connector is available on request to Gathr users.

See the Connector Marketplace topic. Please request your administrator to start a trial or subscribe to the Premium Couchbase connector.

In Gathr, it can be added as a channel to help in fetching customers’ and prospects’ data and transform it as needed before storing it in a desired data warehouse to run further analytics.

Data Source Configuration

Configure the data source parameters as explained below.

Fetch From Source/Upload Data File

To design the application, you can either fetch the sample data from the Couchbase source by providing the data source connection details or upload a sample data file in one of the supported formats to see the schema details during the application design phase.

Upload Data File

To design the application, please upload a data file containing sample records in a format supported by Gathr.

The sample data provided for application design should match the data source schema from which data will be fetched during runtime.

If Upload Data File method is selected to design the application, provide the below details.

File Format

Select the format of the sample file depending on the file type.

Gathr-supported file formats for Couchbase data sources are CSV, JSON, TEXT, Parquet and ORC.

For CSV file format, select its corresponding delimiter.

Header Included

Enable this option to read the first row as a header if your Couchbase sample data file is in CSV format.

Upload

Please upload the sample file as per the file format selected above.

👉

Make sure that the file size does not exceed 10 MB.

Fetch From Source

If Fetch From Source method is selected to design the application, then the data source connection details will be used to get sample data.

Continue to configure the data source.

Connection Name

Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Couchbase earlier. Or create one as explained in the topic - Couchbase Connection →

Use the Test Connection option to ensure that the connection with the Couchbase channel is established successfully.

A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve the issue before proceeding further.

Schema Name

Databases will list as per the configured connection. Select the database to be read from.

Entity

Tables in Couchbase are statically defined to model Couchbase entities.

If you selected the Fetch From Source method to design the application, the Entities will list as per the configured connection. Select the entity to be read from Couchbase.

If you selected the Upload Data File method to design the application, the exact name of the entity should be provided to read the data from Couchbase.

If you selected the Fetch From Source method to design the application, the Fields would list as per the Entity chosen in the previous configuration parameter. Select the fields or provide a custom query to read the desired records from Couchbase.

Fields

The conditions to fetch source data from a Couchbase table can be specified using this option.

Select Fields: Select the column(s) of the entity that should be read.

Custom Query: Provide an SQL query specifying the read conditions for the source data.

Example: SELECT "Id" FROM Companies

If you selected the Upload Data File method to design the application, provide a custom query to fetch records from the Couchbase entity specified in the previous configuration.

Query

The conditions to fetch source data from a Couchbase table can be specified using this option.

Provide an SQL query specifying the read conditions for the source data.

Example: SELECT "Id" FROM Companies

Read Options

This section contains additional read options.

Allow JSON Parameters

Allows raw JSON to be used in parameters when QueryPassthrough is enabled.

This option affects how string parameters are handled when using direct N1QL and SQL++ queries through QueryPassthrough.

By default, this option is disabled and string parameters are quoted and escaped into JSON strings. That means that any value can be safely used as a string parameter, but it also means that parameters cannot be used as raw JSON documents.

When this option is enabled, string parameters are assumed to be valid JSON. This means that raw JSON documents can be used as parameters, but it also means that all simple strings must be escaped.

Please refer to Validate JSON Parameters for more details on how parameters are validated when this option is enabled.

Insert Null Values

Determines whether an INSERT should include fields that have NULL values.

By default the connector uses NULL values provided in an INSERT statement and inserts them as JSON null values.

If this option is disabled, SQL NULL values are ignored during an INSERT. In the case of array columns (FlattenArrays must be set to retrieve these), this means that array indices are shifted over to compensate for the values that have been removed.

Query Passthrough

This option passes the query to the Couchbase server as is.

Flatten Objects

Set Flatten Objects to true to flatten object properties into columns of their own. Otherwise, objects nested in arrays are returned as strings of JSON.

The property name is concatenated onto the object name with an underscore to generate the column name.

For example, you can flatten the nested objects below at connection time:

address : { “street” : “123 Main St.”, “city” : “Nowhere”, “state” : “NY”, “zip” : “12345” }

When Flatten Objects is set to true, the preceding object is flattened into the following table:

Column Name	Column Value
address.street	123 Main St.
address.city	Nowhere
address.state	NY
address.zip	12345

Update Null Values

Determines whether an UPDATE writes NULL values as NULL, or removes them.

If this option is disabled SQL NULL values in an UPDATE will cause the connector to mark the field as MISSING. This removes the field from the object containing it, or if the field is contained in an array (per FlattenArrays) then that element is set to NULL.

This option should be used with care as the connector may not detect that the field exists if it is removed from enough documents within a bucket.

Flexible Schemas

Whether the provider allows queries to use columns that it has not discovered.

By default connector will only allow queries to use columns that it has found during the metadata discovery process. This means that the connector has the full information for each column it presents, but it also means that fields set on only a few documents may not be exposed. Disabling this option means that the connector will allow you to write a query with any columns you want. If you use columns in a query that have not been discovered the connector will assume that they are simple strings.

For example, the connector uses column type information to automatically convert dates for comparision since Couchbase cannot natively compare dates directly. If the connector detects that datecol is a date field, it can apply the STR_TO_MILLIS conversion automatically.

When using undiscovered columns the connector cannot make this type of conversion for you. You must apply any needed conversions manually to ensure that operations behave the way you want them to.

Expose TTL

Specifies whether document TTL information should be exposed.

By default the connector does not expose TTL values or consider document TTLs when performing DML operations. Enabling this option exposes TTL values in two ways:

All tables get a new column called Document.Expiration which contains the TTL value for each document. This column is an integer and returns whatever TTL value is stored in Couchbase directly. This column is read-write on bucket tables and read-only on child tables.
INSERT and UPDATE will use this field to set TTL values, or to preserve them (for update) when none is provided. Setting the field to either 0 or NULL will remove the TTL from any affected documents.

Note that enabling this features requires that your server be version 6.5.1 or later and that your CouchbaseService is set to N1QL. If either of these is not the case the connector will not connect.

Numeric Strings

Whether to allow string values to be treated as numbers.

By default this property is enabled and the connector will treat string values as numeric if they all the values it samples during schema detection are numeric. This can cause type errors later on if the field contains non-numeric values in other documents. If this property is disabled then numeric strings are left as strings although other string-based data types like timestamps will still be detected.

For example, the “code” field in the below bucket would be affected by this setting. By default it would be considered an integer but if this property were enabled it would be treated as a string.

{ “code”: “123”, “message”: “Please restart your computer” } { “code”: “456”, “message”: “Urgent update must be applied” }

Ignore Child Aggregates

Whether the provider exposes aggregate columns that are also available as child tables. Ignored if TableSupport is not set to Full.

The connector will expose array fields within a bucket as a separate child table, such as in the Games_scores example described in Automatic Schema Discovery. By default the connector will also expose these array fields as JSON aggregates on the base table. For example, either of these queries would return information on game scores:

Return each score as an individual row * SELECT value FROM Games_scores;
Return all scores for each Game as a JSON string * SELECT scores FROM Games;

Since these aggregates are exposed on the base table, they will be generated even when the information they contain is redundant. For example, when performing this join the scores aggregate on Games is populated as well as the value column on Games_scores. Internally this causes two copies of the scores data to be transferred from Couchbase.

Retrieves score data twice, once for Games.scores and once for Games_scores.value * SELECT * FROM Games INNER JOIN Games_scores ON Games.[Document.Id] = Games_scores.[Document.Id]

This option can be used to prevent the aggregate field from being exposed when the same information is also available from a child table. In the games example, setting this option to true means that the Games table would only expose a primary key column. The only way to retrieve information about scores would be the child table, so score data would only be read once from Couchbase.

Only exposes Document.Id, not scores * SELECT * FROM Games;
Only retrieves score data once for Games_scores.value * SELECT * FROM Games INNER JOIN Games_scores ON Games.[Document.Id] = Games_scores.[Document.Id]

Note that this option overrides FlattenArrays, since all data from flattened arrays is also avaialable as child tables. If this option is set then no array flattening is performed, even if FlattenArrays is set to a value over 0.

Use Collections For DDL

Whether to assume that CREATE TABLE statements use collections instead of flavors. Only takes effect when connecting to Couchbase v7+ and GenerateSchemaFiles is set to OnCreate.

Normally the connector will assume that compound table names referenced in a CREATE TABLE statement are flavors. For compatibility, this is still the default with Couchbase v7+ even though flavors are not recommended there.

CREATE TABLE [myBucket.myFlavor]( [Document.Id] VARCHAR PRIMARY KEY, docType VARCHAR, sometext VARCHAR, somenum INT )

Enable this option to assume that CREATE TABLE statements refer to collection instead. In that scenario this query will create the bucket and scope if necessary, before creating the colleciton and setting a primary index:

CREATE TABLE [myBucket.myScope.myCollection]( [Document.Id] VARCHAR PRIMARY KEY, sometext VARCHAR, somenum INT )

Validate JSON Parameters

Allows the provider to validate that string parameters are valid JSON before sending the query to Couchbase.

When Allow JSON Parameters and Query Passthrough are enabled, the query parameters given to the connector will be treated as raw JSON documents instead of arbitrary string values. This option controls what happens when invalid JSON is given to the connector in this mode.

When this option is enabled, the connector will check that all string parameters can be parsed as valid JSON. If any cannot be, an error will be raised and the query will not be run.

When this option is disabled, no check is performed and all string parameter values are substituted into the query directly. This makes executing prepared statements faster, but less safe since invalid N1QL or SQL++ may be sent to the Couchbase.

Child Separator

The character or characters used to denote child tables.

When creating a child table for an array underneath a bucket, the connector will generate the name of the child table by concatenating the name of the base table, along with this separator and each path element.

For example, if this document were in the bucket “customers”, then the child table for the addresses field would be called “customers_addresses”.

{ “addresses”: [ {“street”: “123 Main St”}, {“street”: “424 Pleasant Ct”}, {“street”: “719 Blue Way”} ] }

Create Table Ram Quota

The default RAM quota, in megabytes, to use when inserting buckets via the CREATE TABLE syntax.

Dataverse Separator

The character or characters used to denote Analytics dataverses and scopes/collections.

When using the Analytics serivce, the connector will scan all datasets from all available dataverses. To avoid potential name conflicts, it will include the dataverse name and the dataset name in the generated table name.

By default this is set to “.”, so that if there is a dataset called “users” on the “Default” dataverse, then the table generated will be “Default.users”.

This property is also used when generating table names for collections (on both N1QL and Analytics) on Couchbase 7 and later. For example, a bucket called “users” that has two collections called “active” and “inactive” under the “status” scope would be detected as the tables “users.status.active” and “users.status.inactive”.

Flatten Arrays

The number of elements to expose as columns from nested arrays. Ignored if IgnoreChildAggregates is enabled.

By default, nested arrays are returned as strings of JSON. The Flatten Arrays property can be used to flatten the elements of nested arrays into columns of their own. This is only recommended for arrays that are expected to be short.

Set Flatten Arrays to the number of elements you want to return from nested arrays. The specified elements are returned as columns. The zero-based index is concatenated to the column name. Other elements are ignored.

For example, you can return an arbitrary number of elements from an array of strings:

[“FLOW-MATIC”,“LISP”,“COBOL”]

When FlattenArrays is set to 1, the preceding array is flattened into the following table:

Column Name	Column Value
languages.0	FLOW-MATIC

Flavor Separator

The character or characters used to denote flavors.

When the connector detects a flavored table, using either a DocType or Infer TypeDetectionScheme, it names flavored tables by concatenating the underlying bucket name, this seprator, and the value of the bucket’s primary flavor.

For example, if the connector detects the flavor “docType = ‘beer’” on the “beer-sample” bucket, then it will generate the table “beer-sample.beer” which contains only documents in “beer-sample” which have the “beer” doctype.

Generate SchemaFiles

Indicates the user preference as to when schemas should be generated and saved.

Possible Values

Never, OnUse, OnStart, OnCreate

Generate SchemaFiles enables you to save the table definitions identified by Automatic Schema Discovery. This property outputs schemas to .rsd files in the path specified by Location.

Available settings are the following:

Never: A schema file will never be generated.
OnUse: A schema file will be generated the first time a table is referenced, provided the schema file for the table does not already exist.
OnStart: A schema file will be generated at connection time for any tables that do not currently have a schema file.
OnCreate: A schema file will be generated by when running a CREATE TABLE SQL query.

Note that if you want to regenerate a file, you will first need to delete it.

Generate Schemas with SQL

When you set Generate SchemaFiles to OnUse, the connector generates schemas as you execute SELECT queries. Schemas are generated for each table referenced in the query.

When you set Generate SchemaFiles to OnCreate, schemas are only generated when a CREATE TABLE query is executed.

Generate Schemas on Connection

Another way to use this property is to obtain schemas for every table in your database when you connect. To do so, set Generate SchemaFiles to OnStart and connect.

Alternatives to Static Schemas

If your data structures are volatile, consider setting Generate SchemaFiles to Never and using dynamic schemas.

Editing Schemas

Schema files have a simple format that makes them easy to modify.

Page size

The maximum number of results to return per page from Couchbase.

The Pagesize property affects the maximum number of results to return per page from Couchbase.

Setting a higher value may result in better performance at the cost of additional memory allocated per page consumed.

Periods Separator

The character or characters used to denote hierarchy.

When flattening objects and arrays, the connector will use this value to separate different levels of objects and arrays. For example, if your Couchbase server returns a document like this (and FlattenObjects is enabled), then the connector will return the columns “geo.latitude” and “geo.longitude” if the periods separator is set to “.”.

{ “geo”: { “latitude”: 35.9132, “longitude”: -79.0558 } }

Query Execution Timeout

This sets the server-side timeout for the query, which governs how long Couchbase will execute the query before returning a timeout error.

The default is -1, which disables the timeout. When enabling the timeout, the value must include both an amount and a unit, which can be one of: “ns” (nanoseconds), “us” (microseconds), “ms” (milliseconds), “s” (seconds), “m” (minutes) or “h” (hours). For example, “5m” and “300s” both set timeouts of 5 minutes.

There is a server-side timeout as well called the “index scan timeout”, which will override this one if it is lower. By default the index scan timeout is 2 minutes, but it can be changed by setting the “indexer.settings.scan_timeout” property on your Couchbase server.

Strict Comparison

Adjusts how precisely to translate filters on SQL input queries into Couchbase queries. This can be set to a comma-separated list of values, where each value can be one of: date, number, boolean, or string.

This option is empty by default, which means that WHERE clauses sent to Couchbase will include extra functions that convert values so that more comparisons work.

For example, leaving the “string” setting out of the list causes arrays to be converted, so that they can be compared with strings:

SELECT * FROM Bucket WHERE MyArrayColumn = ‘[1,2,3]’

If set to a value, queries including the relevant types of comparisons will be translated literally. This makes better use of Couchbase’s indexes, but means that the types of comparisons must be in a format Couchbase can compare directly.

For example, if “date” is provided as one of the options, then dates must match the format they are stored as in Couchbase since they will not be converted automatically.

Transaction Timeout

This sets the amount of time a transaction may execute before it is timed out by Couchbase.

If transactions are enabled, then the connector will default to the server’s default transaction timeout setting.

When enabling the timeout, the value must include both an amount and a unit, which can be one of: “ns” (nanoseconds), “us” (microseconds), “ms” (milliseconds), “s” (seconds), “m” (minutes) or “h” (hours). For example, “5m” and “300s” both set timeouts of 5 minutes.

There are also cluster-level and node-level transaction timeouts which override this one if they are smaller. For example, if the node-level timeout is set to a minute then setting this option to “5m” will have no effect.

Infer Num Sample Values

The maximum number of values for every field to scan before determining its data type. Applies to Automatic Schema Discovery when TypeDetectionScheme is set to INFER.

The maximum number of values to scan from every field of the sampled documents before determining the field’s data type. This property enables additional configuration of Automatic Schema Discovery when you are using the Couchbase Infer command – TypeDetectionScheme must also be set to Infer to use this propery.

Infer Sample Size

The maximum number of documents to scan for the columns available in the bucket. Applies to Automatic Schema Discovery when TypeDetectionScheme is set to INFER.

The maximum number of documents to scan for the columns available in the bucket. The Infer command will return column metadata by scanning a random sample of documents of the size specified here.

Setting a high value may decrease performance. Setting a low value may prevent the column and data type from being determined properly, especially when there is null data.

This property enables additional configuration of Automatic Schema Discovery when you are using the Couchbase Infer command – TypeDetectionScheme must also be set to Infer to use this propery.

InferSimilarityMetric

Specifies the similarity degree where different schemas will be considered to be the same flavor. Applies to Automatic Schema Discovery when TypeDetectionScheme is set to INFER.

This property specifies how similar two schemas must be to be considered to be the same flavor.

As an example, consider the following rows:

Row 1: ColA, ColB, ColC, ColD Row 2: ColA, ColB, ColE, ColF Row 3: ColB, ColF, ColX, ColY

You can configure the columns returned for each flavor with different InferSimilarityMetric values, as in the following examples:

If you set InferSimilarityMetric to 1, the connector will return no flavors.
If you set InferSimilarityMetric to 0.5, the connector will return 2 flavors, Row1 and Row2 making up one, and Row3 making up another.
If you set InferSimilarityMetric to 0.25, the connector will return a single flavor containing all rows.

You can then query document flavors using dot notation, as in the following statement:

SELECT * FROM [Items.Technology]

This property enables additional configuration of Automatic Schema Discovery when you are using the Couchbase Infer command – TypeDetectionScheme must also be set to Infer to use this propery.

Partitioning

This section contains partitioning-related configuration parameters.

Enable Partitioning

This enables parallel reading of the data from the entity.

Partitioning is disabled by default.

If enabled, an additional option will appear to configure the partitioning conditions.

Column

The selected column will be used to partition the data.

Max Rows per Partition: Enter the maximum number of rows to be read in a single request.

Example: 10,000

It implies that a maximum number of 10,000 rows can be read in one partition.

Advanced Configuration

This section contains additional configuration parameters.

Fetch Size

The number of rows to be fetched per round trip. The default value is 1000.

Add Configuration: Additional properties can be added using this option as key-value pairs.

Schema

Check the populated schema details. For more details, see Schema Preview →

If you have any feedback on Gathr documentation, please email us!

Couchbase Ingestion Source

Data Source Configuration #

Fetch From Source/Upload Data File #

Upload Data File #

File Format #

Header Included #

Upload #

Fetch From Source #

Connection Name #

Schema Name #

Entity #

Fields #

Query #

Read Options #

Allow JSON Parameters #

Insert Null Values #

Query Passthrough #

Flatten Objects #

Update Null Values #

Flexible Schemas #

Expose TTL #

Numeric Strings #

Ignore Child Aggregates #

Use Collections For DDL #

Validate JSON Parameters #

Child Separator #

Create Table Ram Quota #

Dataverse Separator #

Flatten Arrays #

Flavor Separator #

Generate SchemaFiles #

Generate Schemas with SQL #

Generate Schemas on Connection #

Alternatives to Static Schemas #

Editing Schemas #

Page size #

Periods Separator #

Query Execution Timeout #

Strict Comparison #

Transaction Timeout #

Infer Num Sample Values #

Infer Sample Size #

InferSimilarityMetric #

Partitioning #

Enable Partitioning #

Column #

Advanced Configuration #

Fetch Size #

Schema #