Running a Python-based job

You can run a Python script to execute a spark-submit or pyspark command.

In this task, you execute the following Python script that creates a table and runs a few queries:
/* spark-demo.py */
from pyspark import SparkContext
sc = SparkContext("local", "first app")
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
hive_context.sql("drop table default.sales_spark_2_copy")
hive_context.sql("CREATE TABLE IF NOT EXISTS default.sales_spark_2_copy as select * from default.sales_spark_2")
hive_context.sql("show tables").show()
hive_context.sql("select * from default.sales_spark_2_copy limit 10").show()
hive_context.sql("select count(*) from default.sales_spark_2_copy").show()
Install Python 2.7 or Python 3.5 or higher.
  1. Log into a Spark gateway node.
  2. Ensure the required security token is authorized to compile and execute the workload (if your cluster is Kerberized).
  3. Execute the script using the spark-submit command.
    spark-submit spark-demo.py --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1
  4. Go to the Spark History server web UI at http://<spark_history_server>:18088, and check the status and performance of the workload.

Using pyspark

Run your application with the pyspark or the Python interpreter.
Install PySpark using pip.
  1. Log into a Spark gateway node.
  2. Ensure the required security token is authorized to compile and execute the workload (if your cluster is Kerberized).
  3. Ensure the user has access to the workload script (python or shell script).
  4. Execute the script using pyspark.
    pyspark spark-demo.py --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1                  
  5. Execute the script using the Python interpreter.
    python spark-demo.py              
  6. Go to the Spark History server web UI at http://<spark_history_server>:18088, and check the status and performance of the workload.