Configure Apache Spark 2 on CDH 5

Provides steps for configuring Apache Spark 2 on CDH 5.

  1. Install and configure the CDS 2.x Powered by Apache Spark parcel and CSD. For instructions, see Installing CDS 2.x Powered by Apache Spark.
  2. To be able to use Spark 2, each user must have their own /home directory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.
    hdfs dfs -mkdir /user/<username>
    hdfs dfs -chown <username>:<username> /user/<username>

    If you are using CDS 2.3 release 2 (or higher), review the associated known issues here: CDS Powered By Apache Spark.

  3. Use Cloudera Manager to create add gateway hosts to your CDH cluster.
    1. Create a new host template that includes gateway roles for HDFS, YARN, and Spark 2.
      (Required for CDH 6 and CDP Data Center) If you want to run workloads on dataframe-based tables, such as tables from PySpark, sparklyr, SparkSQL, or Scala, you must also add the Hive gateway role to the template.
    2. Use the instructions at Adding a Host to the Cluster to add gateway hosts to the cluster. Apply the template created in the previous step to these gateway hosts. If your cluster is kerberized, confirm that the krb5.conf file on your gateway hosts is correct.
  4. Test Spark 2 integration on the gateway hosts.
    1. SSH to a gateway host.
    2. If your cluster is kerberized, run kinit to authenticate to the CDH cluster’s Kerberos Key Distribution Center. The Kerberos ticket you create is not visible to Cloudera Data Science Workbench users.
    3. Submit a test job to Spark by executing the following command:
      CDH 5
      spark2-submit --class org.apache.spark.examples.SparkPi --master yarn \
      --deploy-mode client /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-example*.jar 100
      To view a sample command, click
      spark2-submit --class org.apache.spark.examples.SparkPi --master yarn \
      --deploy-mode client /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples*.jar 100
      Show CDH 6 and CDP Data Center 7
      spark2-submit --class org.apache.spark.examples.SparkPi --master yarn \
      --deploy-mode client /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples*.jar 100
      To view a sample command, click
      spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
      --deploy-mode client /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples*.jar 100
    4. View the status of the job in the CLI output or in the Spark web UI to confirm that the host you want to use for the Cloudera Data Science Workbench master functions properly as a Spark gateway.
      To view a sample command, click Show.