Apache Spark

Monitoring Spark Application

To monitor spark_on_yarn applications invoked from CDSW, an embedded Spark UI is displayed right next to the session/job. This was achieved by disabling RM proxy. However, with this change, attempts to access the same Spark application using the RM UI will result in Error 500 (connection refused).

Workaround: If the Administrator wants to troubleshoot a running spark-on-yarn application invoked by an end-user from the workbench, the user must share their session using the Share button on the right side of the console. An alternate workaround which will not provide realtime updates is to access the Spark Application UI from the Spark History Server UI > Incomplete Applications.

Cloudera Bug: DSE-4979

Spark UI does not work on HDP and CDP

The Spark UI in CDSW does not work on HDP and CDP clusters.

Scala sessions can fail if dependencies take longer than 15 minutes

If the dependencies in spark-defaults.conf (spark.jars, spark.packages, etc) take longer than 15 minutes to resolve, then scala sessions will fail the first time.

Workaround: Use one of the following workarounds:

Restart the session.
Mount the Spark dependency directory from the CDSW host machines.

On TLS-enabled CDSW deployments, the embedded Spark UI does not work

If you have a TLS-enabled CDSW deployment, the embedded Spark UI tab does not render as expected.

Workaround: To work around this issue, launch the Spark UI in a separate tab and append '/jobs' after the URL. For example, if your engineID is tb0z9ydiua5q9v2d and the DOMAIN is example.com then view the Spark UI at: https://spark-tb0z9ydiua5q9v2d.example.com/jobs/

Alternative workaround: To view running Spark jobs, navigate to Spark History Server UI > Show Incomplete Applications > Application ID

Affected Versions: This issue affects CDSW 1.6.x and CDSW 1.7.x on the following platforms:

CDH 5: CDS 2.4 release 2 (and lower)
CDH 6: Versions of Spark that ship with CDH 6.0.x, CDH 6.1.x, CDH 6.2.1 (and lower), CDH 6.3.2 (and lower)

Solution: Upgrade to CDSW version 1.7.1 or higher, and either:

CDH version 6.4.0, 6.2.2, 6.3.3 or higher
CDH 5 with Spark 2.4 release 3

Spark lineage collection is not supported with Cloudera Data Science Workbench

Lineage collection is enabled by default in Spark 2.3. This feature does not work with Cloudera Data Science Workbench because the lineage log directory is not automatically mounted into CDSW engines when a session/job is started.

Affected Versions: CDS 2.3 release 2 (and higher) Powered By Apache Spark

With Spark 2.3 release 3 (or higher), if Spark cannot find the lineage log directory, it will automatically disable lineage collection for that application. Spark jobs will continue to execute in Cloudera Data Science Workbench, but lineage information will not be collected.

With Spark 2.3 release 2, Spark jobs will fail in Cloudera Data Science Workbench. Either upgrade to Spark 2.3 release 3 which includes a partial fix (as described above) or use one of the following workarounds to disable Spark lineage:

Workaround 1: Disable Spark Lineage Per-Project in Cloudera Data Science Workbench

To do this, set spark.lineage.enabled to false in a spark-defaults.conf file in your Cloudera Data Science Workbench project. This will need to be done individually for each project as required.

Workaround 2: Disable Spark Lineage for the Cluster

Log in to Cloudera Manager and go to the Spark 2 service.
Click Configuration.
Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection.
Click Save Changes.
Go back to the Cloudera Manager homepage and restart the CDSW service for this change to go into effect.

Cloudera Bug: DSE-3720, CDH-67643