Known issues with Apache Spark.
CDSW session only allows one SparkContext to be instantiated at a time in client mode
A CDSW session session allows only one SparkContext to be instantiated at a time. If you create a second SparkContext in client mode it will fail.
Workaround: Stop the SparkContext session in Jupyter notebook or run
spark-submit in cluster mode.
Cloudera Bug: DOCS-16184
CDSW does not support Spark 3
CDSW does not support Spark 3, however, CML Private Cloud does.
Cloudera Bug: DSE-11729
Monitoring Spark Application
To monitor spark_on_yarn applications invoked from CDSW, an embedded Spark UI is displayed right next to the session/job. This was achieved by disabling RM proxy. However, with this change, attempts to access the same Spark application using the RM UI will result in Error 500 (connection refused).
Workaround: If the Administrator wants to troubleshoot a running spark-on-yarn application invoked by an end-user from the workbench, the user must share their session using the Share button on the right side of the console. An alternate workaround which will not provide realtime updates is to access the Spark Application UI from the Spark History Server UI > Incomplete Applications.
Cloudera Bug: DSE-4979
Scala sessions can fail if dependencies take longer than 15 minutes
If the dependencies in spark-defaults.conf (spark.jars, spark.packages, etc) take longer than 15 minutes to resolve, then scala sessions will fail the first time.
- Restart the session.
- Mount the Spark dependency directory from the CDSW host machines.
Spark UI does not work on HDP and CDP versions up to 7.1.7 SP1
The Spark UI in CDSW does not work on HDP and CDP (up to 7.1.7 SP1) clusters. For CDP Private Base the Spark UI has been fixed in 7.1.8 and the fix is also expected in an upcoming 7.1.7 Service Pack release.
On TLS-enabled CDSW deployments, the embedded Spark UI does not work
If you have a TLS-enabled CDSW deployment, the embedded Spark UI tab does not render as expected.
Workaround: To work around this issue, launch the Spark UI in a separate tab and
append '/jobs' after the URL. For example, if your engineID is
tb0z9ydiua5q9v2d and the DOMAIN is example.com then view the Spark UI at:
Alternative workaround: To view running Spark jobs, navigate to
- CDH 5: CDS 2.4 release 2 (and lower)
- CDH 6: Versions of Spark that ship with CDH 6.0.x, CDH 6.1.x, CDH 6.2.1 (and lower), CDH 6.3.2 (and lower)
- CDH version 6.4.0, 6.2.2, 6.3.3 or higher
- CDH 5 with Spark 2.4 release 3
Spark lineage collection is not supported with Cloudera Data Science Workbench
Lineage collection is enabled by default in Spark 2.3. This feature does not work with Cloudera Data Science Workbench because the lineage log directory is not automatically mounted into CDSW engines when a session/job is started.
Affected Versions: CDS 2.3 release 2 (and higher) Powered By Apache Spark
With Spark 2.3 release 3 (or higher), if Spark cannot find the lineage log directory, it will automatically disable lineage collection for that application. Spark jobs will continue to run in Cloudera Data Science Workbench, but lineage information will not be collected.
With Spark 2.3 release 2, Spark jobs will fail in Cloudera Data Science Workbench. Either upgrade to Spark 2.3 release 3 which includes a partial fix (as described above) or use one of the following workarounds to disable Spark lineage:
Workaround 1: Disable Spark Lineage Per-Project in Cloudera Data Science Workbench
To do this, set
false in a
spark-defaults.conf file in your Cloudera Data Science Workbench
project. This will need to be done individually for each project as required.
Workaround 2: Disable Spark Lineage for the Cluster
- Log in to Cloudera Manager and go to the Spark 2 service.
- Click Configuration.
- Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection.
- Click Save Changes.
- Go back to the Cloudera Manager homepage and restart the CDSW service for this change to go into effect.
Cloudera Bug: DSE-3720, CDH-67643