Cache Processor
Configuring the Cache for Spark pipelines.
To add the processor to your pipeline, drag the processor to the canvas and right click on it to configure.
Field | Description |
---|---|
Storage Level | Select option to store RDD. The below options are available in the drop-down list: Memory Only: Stores RDD as decentralized Java objects in the JVM. If RDD does not fit in memory, some partitions will not be cached and be recomputed on-the-go each time they are needed (Default). Memory Only_2: Same as Memory Only but replicates each partition on two cluster nodes. Memory Only SER: Stores RDD as serialized Java objects for space efficiency (one byte array per partition). Memory Only SER_2: Same as Memory Only SER but replica of each partition is replicated partitions on two cluster nodes. Disk Only: Stores the RDD partitions only on disk. Disk Only_2: Same as Disk Only but the replica of each partition is replicated on two cluster nodes. Memory and Disk: Store RDD as decentralized Java objects in the JVM. If the RDD does not fit in memory, store partitions that don’t fit in disk and read from there when needed. Memory and Disk_2: Same as Memory and Disk but replicates each partition on two clusters. Memory and Disk SER: Similar to Memory Only SER, but partitions that don’t fit in memory to disk instead of recomputing. Memory and Disk SER_2: Same as Memory and Disk SER, but the replica of each partition is replicated on two cluster nodes. |
Refresh Cache | Option to enable caching operation. |
Refresh Interval | The time interval after which the application’s page is to be reloaded or refreshed. |
ADD CONFIGURATION | Additional properties can be added using + ADD CONFIGURATION button. |
Environment Params | |
ADD PARAMS | User can add further environment parameters as per requirement. |
Click Next, to proceed further.
If you have any feedback on Gathr documentation, please email us!