Repartition Processor
In this article
The Repartition processor changes how pipeline data is partitioned by dividing large datasets into multiple parts.
The Repartition processor is used when you want to increase or decrease the parallelism in an executor. The number of parallelism maps to number of tasks running in an executor. It creates either more or fewer partitions to balance data across them and shuffles data over the network.
For example, you can use single partition if you wish to write data in a single file. Multiple partitions can be used for writing data in multiple files.
Repartition Processor Configuration
To add a Repartition processor to your pipeline, drag the processor onto the pipeline canvas and right-click on it to configure.
Field | Description |
---|---|
Partition By | The user can select the repartition type either on the basis of number. column or expression Number:Upon selecting Number as an option to repartition, the user can enter value for the number of executors (threads) of a processor or channel. Column Partition Columns: Upon selecting Column as an option to repartition, the user requires to select the columns/fields from on which the partition is to be done. Partition Number: Enter the value for number of executors (threads) of a processor/channel. Expression Partition Expression: Upon selecting Expression as an option to repartition, the user must enter the expression value according to which repartition is to be done. Partition Number: Enter value for the number of executors (threads) of a processor/channel. |
Enter the number of partitions in the Parallelism field.
Click on the NEXT button. Enter the notes in the space provided.
Click on the Done button after entering all the details.
If you have any feedback on Gathr documentation, please email us!