Pinecone ETL Target
Pinecone ETL Target allows you to emit and manage data from your Gathr application to Pinecone, leveraging the simplicity and performance of Pinecone’s vector database for AI applications.
Target Configuration
Configure the data emitter parameters as explained below.
Connection Name
Connections are the service identifiers. A connection name can be selected from the list if you have created and saved connection details for Pinecone earlier. Or create one as explained in the topic - Pinecone Connection →
Use the Test Connection option to ensure that the connection with the Pinecone channel is established successfully.
A success message states that the connection is available. In case of any error in test connection, edit the connection to resolve the issue before proceeding further.
Index
When emitting data, you can choose an existing index or create a new one. If you pick an existing index, your data will be written to it. Otherwise, you can create a new index with a unique name, and your processed data will be written to it during the application run.
Index Info
Additional information about the selected Index is provided here.
For example: Name, Metric, Dimension, Pod Type, Status, Shards, Replicas, Pods, Metadata Config, and Source Collection.
CONFIGURE INDEX
Configure the below fields necessary to create Index.
Index Type
Select the Index type to create the Pinecone Index.
Supported Types: Serverless Index and Pod Based Index
Metric and Dimension(s)
The distance metric to be used for similarity search. You can use ’euclidean’, ‘cosine’, or ‘dotproduct’.
Additionally, specify the dimensions of the vectors to be inserted into the index. Please enter dimension size between 1 to 20000.
Parameters for Pod Based Index
Environment
Select or enter a valid environment name to create Pinecone index.
Pod Type
Pod Type determines the hardware configuration for Pinecone indexes; s1 is storage-optimized, p1 is performance-optimized, and p2 is optimized for query throughput.
Pods
The number of pods required for running your Pinecone service. Generally, more pods mean more storage capacity, lower latency, and higher throughput.
Replicas
The number of times the index should be duplicated. Replicas provide higher availability and throughput.
Pod Size
Pods come in four different sizes: x1, x2, x4, and x8.
Each size doubles your index storage and compute capacity. The default size is x1.
Parameters for Serverless Index
Cloud Provider
Select or enter a valid cloud provider to create Pinecone index.
Region
Select or enter a valid cloud provider region to create Pinecone index.
Index Metadata
When creating a new index, you can select the metadata fields you want to index.
Indexing metadata fields can help speed up searches while saving memory space.
If no field is selected, then all the metadata is indexed.
COLUMNS TO INGEST
Specify the configuration options for the columns to be ingested into Pinecone.
Values
Select the columns with vectors to be ingested.
Constraints: Select an array(floats, double) column or JSON Array string.
ID
The ID is a unique identifier for each record.
Select the column to ingest IDs or choose to autogenerate them in a serial order.
Supports: String or Integer or Long columns.
Metadata
The metadata field allows associating additional information with each vector in an index. It uses key-value pairs, where keys are strings, and values can be strings, numbers, booleans, or lists of strings.
Key
The key in metadata is a label or category that identifies a specific piece of information. It’s like a name tag that helps you find what you’re looking for.
Mapping Value
Select the input data that is to be associated with the Key in a key-value pair within Metadata. It can be a string, number, Boolean, or a list of strings. Constraints: Only Boolean, Integer, Long, String, Array columns are supported.
WRITE CONFIGURATION
Configure how the to write the data in Pinecone.
Save Mode
Save mode specifies how to handle any existing data in the target.
The options are:
Upsert: Contents will be inserted or updated depending on Id column values.
Overwrite: Existing data in the target Index will be overwritten with the current data.
Write Mode
Sync: Updates are applied immediately to maintain real-time consistency.
Async: Updates occur in the background, optimizing system performance.
Batch Size
Batch Size determines the number of rows to insert per request. Please enter batch size between 1 to 100.
Parallelism
The maximum number of simultaneous connections that can be established with Pinecone.
Output Mode
Output Mode specifies how to write the data.
The options are:
Append: Output Mode in which only the new rows in the streaming data will be written to the sink
Complete: Output Mode in which all the rows in the streaming data will be written to the sink every time there are some updates
Enable Trigger
Trigger defines how frequently a streaming query should be executed.
If trigger is enabled, provide the processing time.
Processing Time
The time interval or conditions set to determine when streaming data results are emitted or processed.
Add Configuration: Additional properties can be added using this option as key-value pairs.
Notes
Optionally, enter notes in the Notes → tab and save the configuration.
If you have any feedback on Gathr documentation, please email us!