Apache Spark
Known issues with Apache Spark.
Monitoring Spark Application
To monitor spark_on_yarn applications invoked from CDSW, an embedded Spark UI is displayed right next to the session/job. This was achieved by disabling RM proxy. However, with this change, attempts to access the same Spark application using the RM UI will result in Error 500 (connection refused).
Workaround: If the Administrator wants to troubleshoot a running spark-on-yarn application invoked by an end-user from the workbench, the user must share their session using the Share button on the right side of the console. An alternate workaround which will not provide realtime updates is to access the Spark Application UI from the Spark History Server UI > Incomplete Applications.
Cloudera Bug: DSE-4979
Spark UI does not work on HDP and CDP
The Spark UI in CDSW does not work on HDP and CDP clusters.
Scala sessions can fail if dependencies take longer than 15 minutes
If the dependencies in spark-defaults.conf (spark.jars, spark.packages, etc) take longer than 15 minutes to resolve, then scala sessions will fail the first time.
- Restart the session.
- Mount the Spark dependency directory from the CDSW host machines.
On TLS-enabled CDSW deployments, the embedded Spark UI does not work
If you have a TLS-enabled CDSW deployment, the embedded Spark UI tab does not render as expected.
Workaround: To work around this issue, launch the Spark UI in a separate tab and
append '/jobs' after the URL. For example, if your engineID is
tb0z9ydiua5q9v2d and the DOMAIN is example.com then view the Spark UI at:
https://spark-tb0z9ydiua5q9v2d.example.com/jobs/
Alternative workaround: To view running Spark jobs, navigate to
- CDH 5: CDS 2.4 release 2 (and lower)
- CDH 6: Versions of Spark that ship with CDH 6.0.x, CDH 6.1.x, CDH 6.2.1 (and lower), CDH 6.3.2 (and lower)
- CDH version 6.4.0, 6.2.2, 6.3.3 or higher
- CDH 5 with Spark 2.4 release 3
Spark lineage collection is not supported with Cloudera Data Science Workbench
Lineage collection is enabled by default in Spark 2.3. This feature does not work with Cloudera Data Science Workbench because the lineage log directory is not automatically mounted into CDSW engines when a session/job is started.
Affected Versions: CDS 2.3 release 2 (and higher) Powered By Apache Spark
With Spark 2.3 release 3 (or higher), if Spark cannot find the lineage log directory, it will automatically disable lineage collection for that application. Spark jobs will continue to execute in Cloudera Data Science Workbench, but lineage information will not be collected.
With Spark 2.3 release 2, Spark jobs will fail in Cloudera Data Science Workbench. Either upgrade to Spark 2.3 release 3 which includes a partial fix (as described above) or use one of the following workarounds to disable Spark lineage:
Workaround 1: Disable Spark Lineage Per-Project in Cloudera Data Science Workbench
To do this, set spark.lineage.enabled
to false
in a
spark-defaults.conf
file in your Cloudera Data Science Workbench
project. This will need to be done individually for each project as required.
Workaround 2: Disable Spark Lineage for the Cluster
- Log in to Cloudera Manager and go to the Spark 2 service.
- Click Configuration.
- Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection.
- Click Save Changes.
- Go back to the Cloudera Manager homepage and restart the CDSW service for this change to go into effect.
Cloudera Bug: DSE-3720, CDH-67643