Cloudera Data Science Workbench hosts must be added to your CDH cluster as gateway
hosts, with gateway roles properly configured.
To configure gateway hosts:
-
If you have not already done so and plan to use PySpark, install either the Anaconda parcel or Python (versions 2.7.11
and 3.6.1) on your CDH cluster. For more information see, Python Supported Versions.
-
Configure Apache Spark on your gateway hosts.
-
(CDH 5 only) Install and configure the CDS 2.x Powered by Apache Spark
parcel and CSD. For instructions, see Installing CDS 2.x Powered by Apache
Spark.
-
(Required for CDH 5 and CDH 6) To be able to use Spark 2, each user
must have their own
/home
directory in HDFS. If you sign in to
Hue first, these directories will automatically be created for you.
Alternatively, you can have cluster administrators create these
directories.
hdfs dfs -mkdir /user/<username>
hdfs dfs -chown <username>:<username> /user/<username>
-
Use Cloudera Manager to create add gateway hosts to your CDH cluster.
-
Create a new host template that includes gateway
roles for HDFS, YARN, and Spark 2.
(Required for CDH 6) If you want to run workloads on dataframe-based
tables, such as tables from PySpark, sparklyr, SparkSQL, or Scala, you must
also add the Hive gateway role to the template.
-
Use the instructions at Adding a Host to the Cluster to add
gateway hosts to the cluster. Apply the template created in the previous step
to these gateway hosts. If your cluster is kerberized, confirm that the krb5.conf file on your gateway hosts
is correct.
-
Test Spark 2 integration on the gateway hosts.
-
SSH to a gateway host.
-
If your cluster is kerberized, run
kinit
to authenticate to
the CDH cluster’s Kerberos Key Distribution Center. The Kerberos ticket you
create is not visible to Cloudera Data Science Workbench users.
-
Submit a test job to Spark by executing the following command:
CDH 5
spark2-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-example*.jar 100
To view a sample command, click
spark2-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples*.jar 100
CDH 6
spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client SPARK_HOME/lib/spark-examples*.jar 100
To view a sample command, click
spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples*.jar 100
-
View the status of the job in the CLI output or in the Spark web UI to
confirm that the host you want to use for the Cloudera Data Science Workbench
master functions properly as a Spark gateway.
To view sample CLI output,
click
19/02/15 09:37:39 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.1.0
19/02/15 09:37:39 INFO spark.SparkContext: Submitted application: Spark Pi
...
19/02/15 09:37:40 INFO util.Utils: Successfully started service 'sparkDriver' on port 37050.
...
19/02/15 09:38:06 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 18.659033 s