Use Spark 3 actions with a custom Python executable

Learn how to use a custom Python executable in a given Spark 3 action.

Similar to Spark 2, Spark 3 also provides the capability to define a custom Python executable for use with spark3-submit through the spark.pyspark.python Spark 3 conf argument. For more details, please see the Latest Spark3 documentation. Consequently, if you include the spark.pyspark.python Spark 3 conf in your Oozie Spark 3 action, the Python executable you specify is used when executing the Spark 3 action through Oozie.

To simplify the usage of a customized Python executable with Oozie's Spark 3 action, you can use the oozie.service.Spark3ConfigurationService.spark.pyspark.python property. This property functions similar to Spark 3's spark.pyspark.python conf argument, allowing you to specify a custom Python executable. Oozie then passes this executable to the underlying Spark 3 application executed through Oozie.

You can specify this configuration in different ways.

Setting Spark 3 actions with a custom Python executable globally

You can set it globally in Cloudera Manager. To do that, perform the following steps:
  1. Navigate to Oozie's configuration page in Cloudera Manager.
  2. Search for Python Executable for Spark3 Actions.
  3. Specify its value to point to your custom Python executable.
    For example, if you installed Python 3.7 to /opt/python37-for-oozie, then specify the value as /opt/python37-for-oozie/bin/python3.


  4. Save the modifications.
  5. Allow Cloudera Manager some time to recognize the changes.
  6. Redeploy Oozie.

Setting Spark 3 actions with a custom Python executable per workflows

You can also specify a custom Python executable for a given workflow using the same property:
<workflow-app name="spark_workflow" xmlns="uri:oozie:workflow:1.0">
    <global>
        <configuration>
            <property>
                <name>oozie.service.Spark3ConfigurationService.spark.pyspark.python</name>
                <value>/opt/python37-for-oozie/bin/python3</value>
            </property>
        </configuration>
    </global>
    <start to="spark_action"/>
    <action name="spark_action">
        ...

The same workflow-level Python executable can be achieved if you set the property in your job.properties file.

Setting Spark 3 actions with a custom Python executable for a given Spark action only

Finally, you can only change the Python executable for a given Spark 3 action. For example:
<workflow-app name="spark_workflow" xmlns="uri:oozie:workflow:1.0">
    <start to="spark_action"/>
    <action name="spark_action">
        <spark3 xmlns="uri:oozie:spark3-action:1.0">
            <resource-manager>${resourceManager}</resource-manager>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>oozie.service.Spark3ConfigurationService.spark.pyspark.python</name>
                    <value>/opt/python37-for-oozie/bin/python3</value>
                </property>
            </configuration>
            ...
The following order of precedence is applied for this configuration:
  1. Oozie does not override the configuration of spark.pyspark.python in the <spark-opts> tag of your action definition if you have already set it.
  2. If you have configured the property at the action level, it takes precedence over all other settings, and the remaining configurations are disregarded.
  3. If you have configured the property in the global configuration of the workflow, the value from there is used.
  4. If the setting is not available in either of the previous locations, the value configured in your job.properties file is used.
  5. Lastly, the global setting in Cloudera Manager comes into effect.
It is also possible to inform Oozie that you do not want to use a custom Python executable in a given Spark 3 action, but you want to use the default one configured for Spark 3, even if you already configured at a lower level of precedence. For instance, if the Python Executable for Spark3 Actions property is set in Cloudera Manager to /opt/python37-for-oozie/bin/python3, but in a workflow or in a specific action you want to use the default Python executable configured for Spark 3, you can set the value of the property to default. For example:
<workflow-app name="spark_workflow" xmlns="uri:oozie:workflow:1.0">
    <global>
        <configuration>
            <property>
                <name>oozie.service.Spark3ConfigurationService.spark.pyspark.python</name>
                <value>default</value>
            </property>
        </configuration>
    </global>
    <start to="spark_action"/>
    <action name="spark_action">
        ...
In this scenario, the value set in Cloudera Manager is ignored, and the spark.pyspark.python Spark 3 conf is not set at all.