Use Spark actions with a custom Python executable

Spark 2 supports both PySpark and JavaSpark applications. Learn how to use a custom Python executable in a given Spark action.

In case of PySpark, in Spark 2, you can designate a custom Python executable for your Spark application by utilizing the spark.pyspark.python Spark conf argument. For more details, see Spark 2.4 documentation. Consequently, if you include the spark.pyspark.python Spark conf argument in your Oozie Spark action, the Python executable you specify is used when executing the Spark action through Oozie.

To simplify the usage of a customized Python executable with Oozie's Spark action, you can use the oozie.service.SparkConfigurationService.spark.pyspark.python property. This property functions similarly to Spark's spark.pyspark.python conf argument, allowing you to specify a custom Python executable. Oozie then passes this executable to the underlying Spark application executed through Oozie.

You can specify the oozie.service.SparkConfigurationService.spark.pyspark.python property in different ways.

Setting Spark actions with a custom Python executable globally

You can set it globally in Cloudera Manager through a safety-valve. To do that, perform the following steps:
  1. Navigate to Oozie's configuration page in Cloudera Manager.

  2. Search for Oozie Server Advanced Configuration Snippet (Safety Valve) for oozie-site.xml.

  3. Add a new property named oozie.service.SparkConfigurationService.spark.pyspark.python.

  4. Specify its value to point to your custom Python executable.

    For example, if you installed Python 3.7 to /opt/python37-for-oozie, then specify the value as /opt/python37-for-oozie/bin/python3.


  5. Save the modifications.

  6. Allow Cloudera Manager some time to recognize the changes.

  7. Redeploy Oozie.

Setting Spark actions with a custom Python executable per workflows

You can also specify a custom Python executable for a given workflow using the same property:
<workflow-app name="spark_workflow" xmlns="uri:oozie:workflow:1.0">
    <global>
        <configuration>
            <property>
                <name>oozie.service.SparkConfigurationService.spark.pyspark.python</name>
                <value>/opt/python37-for-oozie/bin/python3</value>
            </property>
        </configuration>
    </global>
    <start to="spark_action"/>
    <action name="spark_action">
        ...

The same workflow-level Python executable can be achieved if you set the property in your job.properties file.

Setting Spark actions with a custom Python executable for a given Spark action only

Finally, you can only change the Python executable for a given Spark action. For example:
<workflow-app name="spark_workflow" xmlns="uri:oozie:workflow:1.0">
    <start to="spark_action"/>
    <action name="spark_action">
        <spark xmlns="uri:oozie:spark-action:1.0">
            <resource-manager>${resourceManager}</resource-manager>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>oozie.service.SparkConfigurationService.spark.pyspark.python</name>
                    <value>/opt/python37-for-oozie/bin/python3</value>
                </property>
            </configuration>
            ...

The following order of precedence is applied for this configuration:

  1. Oozie does not override the configuration of spark.pyspark.python in the <spark-opts> tag of your action definition if you have already set it.

  2. If you have configured the property at the action level, it takes precedence over all other settings, and the remaining configurations are disregarded.

  3. If you have configured the property in the global configuration of the workflow, the value from there is used.

  4. If the setting is not available in either of the previous locations, the value configured in your job.properties file is used.

  5. Lastly, the global safety-valve setting comes into effect.

It is also possible to inform Oozie that you do not want to use a custom Python executable in a given Spark action, but you want to use the default one configured for Spark 2, even if you already configured at a lower level of precedence. For instance, if the oozie.service.SparkConfigurationService.spark.pyspark.python is set as a safety-valve to /opt/python37-for-oozie/bin/python3, but in a workflow or in a specific action you want to use the default Python executable configured for Spark 2, you can set the value of the property to default. For example:
<workflow-app name="spark_workflow" xmlns="uri:oozie:workflow:1.0">
    <global>
        <configuration>
            <property>
                <name>oozie.service.SparkConfigurationService.spark.pyspark.python</name>
                <value>default</value>
            </property>
        </configuration>
    </global>
    <start to="spark_action"/>
    <action name="spark_action">
        ...
In this scenario, the value set in Cloudera Manager in Oozie’s safety-valve configuration, is ignored, and the spark.pyspark.python Spark conf is not set at all.