PostgreSQL Data Asset Source

Create a Data Asset Through PostgreSQL

To create a data asset through PostgreSQL Source, configure parameters as follows:

Connection Name

Connections are the service identifiers.

A connection name can be selected from the list if you have created and saved connection details for PostgreSQL earlier.

Or create one as explained in the topic - JDBC Connection →


Schema Name

Select the name of the source schema for which you wish to view a list of tables.

The schema name serves as an organizational structure that groups related tables together within your PostgreSQL database.

Providing the schema name will enable you to access the tables contained within that specific schema.


Table Name

Select the source table to view its metadata.

You can access detailed information about the table’s structure, columns, data types, and other relevant attributes.

The metadata insight will help in understanding and interacting with the data effectively.


Max No of Rows

Specify the maximum number of sample records you wish to keep in the data asset.

This feature helps in obtaining a manageable subset of data for testing and design purposes, facilitating efficient application development while optimizing resource usage.


Sampling Method

This option offers flexibility in how you retrieve sample data.

Following are the ways:

  • Top N: Retrieve the specified number of initial records from the data source based on the specified maximum number of rows. This is particularly useful when you want to analyze or design with a specific set of initial records.

  • Random Sample: Fetch a random subset of records from your sample data, ensuring a diverse representation. This approach is valuable when you require a more comprehensive assessment of your data’s characteristics.


More Configurations

Expand the More Configurations option to see the additional configuration parameters.

Query

Hive compatible SQL query to be executed in the component.

Design Time Query

Enable Query Partitioning: This enables parallel reading of data from the table. It is disabled by default. Tables will be partitioned if this check-box is enabled.

If Enable Query Partitioning is check marked, additional fields will be displayed as given below:

Type-in Partition Column: Select this option if the Partition Column list shown is empty or you do not see the required column in the list.

Partition on Column: This column will be used to partition the data. This has to be a numeric column, on which spark will perform partitioning to read data in parallel.

Data Type: In case if you have typed-in the partitioning column, you need to specify the data type of that column here.

Autodetect Bounds: Check this option to auto-detect the partition boundaries.

If Autodetect Bounds is check marked, additional fields will be displayed as given below:

Row count in Single Query: Enter the number of rows to be read in a single query.

Example: 10,000.

It implies that 10,000 records will be read in one partition.

Column have unique values: Check this column when the records are unique in the table.

If Autodetect Bounds is disabled, then proceed by updating the following fields.

No. of Partitions: Specifies the number of parallel threads to be invoked to partition the table while reading the data.

Lower Bound: Value of the lower bound for partitioning column. This value will be used to decide the partition boundaries. The entire dataset will be distributed into multiple chunks depending on the values.

Upper Bound: Value of the upper bound for partitioning column. This value will be used to decide the partition boundaries. The entire dataset will be distributed into multiple chunks depending on the values.

If Enable Query Partitioning is disabled, then proceed by updating the following field.

Fetch Size: The fetch size determines the number of rows to be fetched per round trip. The default value is 1000.

Top