Python Processor
The Python processor allows you to perform following operations:
Write custom Python Spark code for defining transformations on Spark DataFrames.
Write custom Python code for processing input records at runtime.
Gathr provides support for Python 3. Multiple version support enables a python processor to run on different python versions.
Configuring Python processor for Spark pipelines
To add a Python processor into your pipeline, drag the processor on the canvas and right click on it to configure. To select a version from Python 3, select either from Python Version property.
Field | Description |
---|---|
Input Source | Input type to the custom Python function that will be called by Gathr. The input provided to the custom python function will be a DataFrame. The custom function is expected to return a DataFrame after any processing. |
Input Type | There are two ways for configuring the Python processor: Inline: This option enables you to**Â write Python code in text editor. If selected, you will view one additional field Python Code. Upload: This option enables you to upload single and multiple python scripts(.py files) and python packages (.egg/.zip files). You have to specify module name (should be part of uploaded files or package) and method name that will be called by python processor. When you select Upload, UPLOAD FILE option appears on the screen, browse and select the files that need to be used in python processor. One additional field Import Module will also appear on the screen, if option selected is Upload |
Python Code | Enables to write custom Python code directly on text editor. |
Import Module | Specify module name which contains function that will be called by python processor. Here you will get list of all uploaded files in drop down list. The drop down list will show only.py files. You can also write a module name if it does not appear in drop down list |
Function Name | Name of the python function that is defined in Inline Python code or uploaded script. |
Python Version | Select a version from Python 3. To use Python 3 in python processor, write code or upload script which is compatible with Python 3. |
Add Configuration | Enables to add Additional properties. |
To pass configuration parameters in Python processor.
You can provide configuration parameters in Python processor in form of key value pair. These parameters will be available in form of dictionary in function given in Function Name field as second argument. So function given in field Function Name will take two arguments: (df, config_map)
Where first argument will be dataframe and second argument will be a dictionary that contains configuration parameters as key value pair.
In below example HDFS path of trained model is provided in configuration parameter with key as ‘model_path’ and same is used in predictionFunction.
If there is any error/exception in python code, an error/exception is generated.
Example:
A simple program is written in text area in Python processor. Here the variable model is used but not defined in program, hence an exception with message ‘global name model is not defined’ is displayed on the screen.
Click on the NEXT button after specifying the values for all the fields.
Enter the notes in the space provided.
Click on the SAVE button after entering all the details.
If you have any feedback on Gathr documentation, please email us!