Provide Schema for ETL Source

Designing ETL applications in Gathr involves configuring schema settings to accurately process and transform data.

Schema Type

Customize how your ETL app handles data on the Schema Type page. Select a method to design your application, specify data formats, and fine-tune options as needed.

Follow these schema configuration settings to ensure accurate interpretation of source data in your ETL applications.

Sample Data

Select the method to provide sample data for designing the application.

  • Fetch From Source: Fetch sample data from the data source. You can provide connection details in the next section.

  • Upload File: Upload a file with sample records.


Data Format

Choose the format of your source data.

Type of data formats supported: CSV, JSON, TEXT, XML, Fixed Length, Parquet, ORC, Avro, and Binary.

  • AVRO and Binary File data types are supported for S3 data source whereas Fixed Length is not supported.

  • MS Excel data type is supported for SFTP data source whereas XML and Fixed Length are not supported.

  • For SFTP source if MS Excel data type is selected, the following fields will appear:

    Specify Sheets: (If file type is selected as MS Excel) The option to specify how the data should be read from the workbook. Either choose the option Specific Sheets and enter the names of all the required sheets that needs to be read. Or, select the option All to read the entire excel file data.

    Schema Sheet: (If All the sheets in the excel file are to be read)
    Preferred sheet name should be entered from which the schema will get inferred.
    If left blank, the schema will be inferred from the first sheet of the Excel workbook.

    Sheet Name: (If Specific Sheets in the excel file are to be read)
    Preferred sheet(s) name should be entered from which the first sheet will be considered for detecting schema and all given sheets will be read during runtime.

    Once you choose either data type, you can edit the schema and then configure the component as per your requirement.


CSV

The data that is being fetched from the source is in CSV format. If CSV data format is selected, delimiter should be specified.

List of supported delimiters:

  • Tab

  • Comma (,)

  • Colon (:)

  • Semi Colon (;)

  • Pipe (|)

The default delimiter is comma (,).

Sample File Contains Headers?

Enable if header is included in source file.

Source File Contains Headers?

Specify whether the data source CSV file includes a header row.


JSON

Select JSON as your Type of Data. The data that is being fetched from the source is in JSON format. The source will read the data in the format it is available.


XML

System will fetch the incoming XML data from the source.

Select XML as your Type of Data and provide the XML XPath value.

XML XPath

Define the XML node path starting with “/” to fetch data. It guides the application to locate and extract the relevant data within the XML structure.

For example, in the path “/root/sales”, the data to be fetched is the value contained within the “sales” element, which is a direct child of the “root” element in the source XML structure.

Default value of XML XPath is ‘/’ which will parse the whole XML data.

For HTTPV2 Data Source:

  • If the application is designed using Source Data provide the row tag to extract data.

    A row tag in XML represents a single record in a dataset. such as in a library’s XML file or in a student database.

    Example 1: Simple XML with Row Tag :

    <library>
        <book>
            <title>Java Programming</title>
            <author>John Smith</author>
            <isbn>123456789</isbn>
        </book>
        <book>
            <title>Python Basics</title>
            <author>Jane Doe</author>
            <isbn>987654321</isbn>
        </book>
    </library>
    

    In this example, the element is the row tag. Each represents a single entry in the library dataset.

    In this example, the correct way to provide row tag in XML XPath field is: book

    Example 2: XML with Row Tag and Attributes:

    <class>
        <student id="1">
            <name>Alice</name>
            <grade>A</grade>
        </student>
        <student id="2">
            <name>Bob</name>
            <grade>B</grade>
        </student>
    </class>
    

    Here, the element serves as the row tag. Each entry represents a student in a class, and the id attribute uniquely identifies each student record.

    In this example, the correct way to provide row tag in XML XPath field is: student


Fixed Length

System will parse the incoming fixed length data.

If the field has data in continuation without delimiters, you can separate the data using Fixed Length data type. Field length value is a comma separated length of each field.

Example:

If there are 3 fields f1,f2,f3 and their max length is 4,10 & 6 respectively.Then the field length value for this data would be 4,10,6.

Parquet

When Parquet is selected as the type of data, the file is uploaded in the schema and the fields are editable.

TEXT

Text files can be uploaded to a source when the data available requires customized parsing. Text message parser is used to read data in any format from the source, which is not allowed when CSV, JSON, etc. parsers are selected in configuration.


Message with Multiple Records (For Kafka)

Enable if a single message form source contains multiple data records separated by a delimiter.

Record Delimiter (For Kafka)

Select or enter the separator for records in the message.


Record Metadata Needed? (For Kafka)

Activate the option to incorporate metadata into the dataset. Enabling this feature will result in the inclusion of metadata such as topic, partition, offset and timestamp, for each individual record.


Enable File Path Reading (For GCS)

Select this option to allow the connector to read and display file paths from GCS.


Enable Recursion (For GCS)

Select this option to include files and folders from subdirectories within the specified GCS path.


Maximum Records to Fetch

Maximum number of rows to be extracted from sample or source data for data sampling and schema detection. Auto schema detection will be done for this sampled count.

Its default value is 100.

Example: Upload Sample File option is used to provide sample data, and the file has 10000 rows. Now, if the Maximum Records to Fetch is set to 200, then only 200 rows will be fetched from the file according to the sampling method provided and the schema will be defined as per the 200 rows data.


Trigger Interval

Specify the interval (in seconds) to fetch data for streaming source.

This setting applies to both design and runtime.

The default value is 3 seconds, and the maximum trigger interval supported is 60 seconds.

Example: If it is 15 Secs, then system will first wait for 15 secs and will fetch all the rows available and create a dataset out of it.


Sampling Method

Choose a method to fetch sample records during the application design.

Following are the ways:

  • Top-N (Serially): Fetch the records serially. For example, if the maximum number of records to fetch is 100, the top 100 sample records will be fetched.

  • Random Sample: Fetch the records randomly. For example, if the maximum number of records to fetch is 100, 100 sample records will be fetched randomly.


Once the schema details are provided using any of the methods, the next tab is for data source configuration.

After configuring a Data Source, the schema is generated.

If Fetch from source is selected to provide schema, the next step will be data source configuration.

Whereas, with Upload Data File the next step is detect schema.

Top