Administering CDS Powered by Apache Spark

Most administration tasks are the same whether you are using Spark 1 or Spark 2. To configure and manage Spark, follow the procedures in the Cloudera Enterprise Spark Guide.

Configuring Spark 2 Tools as the Default

When you start trying out Spark 2, you can do most of your testing by running the standard Spark 1 commands such as pyspark and spark-shell alongside their Spark 2 equivalents such as pyspark2 and spark2-shell. All of these commands are represented as symbolic links in /usr/bin.

If you are testing a workflow that has the original command names hardcoded in other scripts, you might configure the system so that issuing the pyspark command really runs the pyspark2 script, and so on for other Spark-related binaries. This change is done using the Linux alternatives mechanism, which keeps track of the appropriate target for each of the /usr/bin symlinks.

To use Spark 2 tools as the default, run the following script on all hosts in the cluster:

for binary in pyspark spark-shell spark-submit; do
  # Generate the name of the new binary e.g. pyspark2, spark2-shell, etc.
  new_binary=$(echo $binary | sed -e 's/spark/spark2/')
  # Update the old alternative to the client binary to the new client binary
  # Use priority 11 because the default priority with which these alternatives are created is 10
  update-alternatives --install /usr/bin/${binary} ${binary} /usr/bin/${new_binary} 11
done
# For configuration, we need to have a separate command
# because the destination is under /etc/ instead of /usr/bin like for binaries.
# The priority is different - 52 because Cloudera Manager sets up configuration symlinks
# with priority 51.
update-alternatives --install /etc/spark/conf spark-conf /etc/spark2/conf 52
To remove this setting and return to using the Spark contained in CDH, run the following script on all hosts in the cluster. It removes the Spark 2 targets of the symlinks and points those symlinks back to the original Spark-related scripts:
for binary in pyspark spark-shell spark-submit; do
  new_binary=$(echo $binary | sed -e 's/spark/spark2/')
  update-alternatives --remove ${binary} /usr/bin/${new_binary}
done
update-alternatives --remove spark-conf /etc/spark2/conf