Configure Gateway Hosts Using Cloudera Manager
Cloudera Data Science Workbench hosts must be added to your CDH cluster as gateway hosts, with gateway roles properly configured.
- If you have not already done so and plan to use PySpark, install either the Anaconda parcel or Python (versions 2.7.11 and 3.6.1) on your CDH cluster. For more information see, Python Supported Versions.
Configure Apache Spark on your gateway hosts.
(Required CDH 6) To be able to use Spark 2, each user must have their
/homedirectory in HDFS. If you sign in to Hue first, these directories will automatically be created for you. Alternatively, you can have cluster administrators create these directories.
hdfs dfs -mkdir /user/<username> hdfs dfs -chown <username>:<username> /user/<username>If you are using CDS 2.3 release 2 (or higher), review the associated known issues here: CDS Powered By Apache Spark.
- (Required CDH 6) To be able to use Spark 2, each user must have their own
Use Cloudera Manager to create add gateway hosts to your CDH cluster.
Create a new host template that includes gateway
roles for HDFS, YARN, and Spark 2.
(Required for CDH 6) If you want to run workloads on dataframe-based tables, such as tables from PySpark, sparklyr, SparkSQL, or Scala, you must also add the Hive gateway role to the template.
- Use the instructions at Adding a Host to the Cluster to add gateway hosts to the cluster. Apply the template created in the previous step to these gateway hosts. If your cluster is kerberized, confirm that the krb5.conf file on your gateway hosts is correct.
- Create a new host template that includes gateway roles for HDFS, YARN, and Spark 2.
Test Spark 2 integration on the gateway hosts.
- SSH to a gateway host.
If your cluster is kerberized, run
kinitto authenticate to the CDH cluster’s Kerberos Key Distribution Center. The Kerberos ticket you create is not visible to Cloudera Data Science Workbench users.
Submit a test job to Spark by executing the following command:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn \ --deploy-mode client SPARK_HOME/lib/spark-examples*.jar 100To view a sample command, click
spark-submit --class org.apache.spark.examples.SparkPi --master yarn \ --deploy-mode client /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples*.jar 100
View the status of the job in the CLI output or in the Spark web UI to
confirm that the host you want to use for the Cloudera Data Science Workbench
master functions properly as a Spark gateway.
To view sample CLI output, click
19/02/15 09:37:39 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.1.0 19/02/15 09:37:39 INFO spark.SparkContext: Submitted application: Spark Pi ... 19/02/15 09:37:40 INFO util.Utils: Successfully started service 'sparkDriver' on port 37050. ... 19/02/15 09:38:06 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 18.659033 s