CDS Powered by Apache Spark Known Issues

The following sections describe the current known issues and limitations in CDS Powered by Apache Spark. In some cases, a feature from the upstream Apache Spark project is currently not considered reliable enough to be supported by Cloudera. For a number of integration features in CDH that rely on Spark, the feature does not work with CDS Powered by Apache Spark because CDH 5 components are not introducing dependencies on Spark 2.

Continue reading:

Structured Streaming exactly-once fault tolerance constraints

In Spark Structured Streaming, the exactly-once fault tolerance for file sink is valid only for files that are in the manifest. These files are located in the _spark_metadata subdirectory of the file sink output directory. Only process files that have file names starting with digits. Other temporary files can also appear in this directory, but they should not be processed. Typically, these temporary files have names starting with a period (".").

You can list the valid manifest files, excluding the temporary files, by using a command like the following, which assumes your output directory is located at /tmp/output. As the appropriate user, run the following command to list the valid manifest files:

hadoop fs -ls /tmp/output/_spark_metadata/[0-9]*

DecimalType push-down to Parquet data sources has been disabled

The support for DecimalType push-down to Parquet data sources, introduced in upstream Apache Spark 2.4 (SPARK-24549), has been disabled in CDS 2.4 release 1.

JOB_SUMMARY_LEVEL Parquet flag is not supported

The JOB_SUMMARY_LEVEL Parquet flag is not supported in CDS 2.4. When writing to Parquet files, users should use the ENABLE_JOB_SUMMARY flag instead.

LZ4, BROTLI, and ZSTD codecs are not supported

CDS 2.4 does not support the LZ4, BROTLI, and ZSTD codecs.

SparkSQL StringStartsWith filter is not supported

The SparkSQL StringStartsWith filter is not supported with CDS 2.4.

Apache Spark XSS vulnerability in UI CVE-2018-8024

A malicious user can construct a URL pointing to a Spark UI's job and stage info pages that can be used to execute arbitrary scripts and expose user information, if this URL is accessed by another user who is unaware of the malicious intent.

Products affected: CDS Powered By Apache Spark

Releases affected:
  • CDS 2.1.0 release 1 and release 2
  • CDS 2.2.0 release 1 and release 2
  • CDS 2.3.0 release 2

Users affected: Potentially any user who uses the Spark UI.

Date/time of detection: May 28, 2018

Detected by: Spencer Gietzen (Rhino Security Labs)

Severity (Low/Medium/High): High

Impact: XSS vulnerabilities can be used to steal credentials or to perform arbitrary actions as the targeted user.

CVE: CVE-2018-8024

Immediate action required: Upgrade to a version of CDS Powered by Apache Spark where this issue is fixed, or as a workaround, disable the Spark UI for jobs and the Spark History Server.

Addressed in release/refresh/patch: CDS 2.3.0 release 3.

CDS 2.3 Release 3 requires additional configuration to run PySpark on Cloudera Data Science Workbench versions 1.3.x (and lower)

Due to a security fix in CDS 2.3 release 3, there is now a mismatch between the versions of py4j that ship with the two products:
  • Cloudera Data Science Workbench 1.3.x (and lower) includes py4j 0.10.4, and,
  • CDS 2.3 release 3 includes py4j 0.10.7.

This version mismatch results in PySpark session/job failures on Cloudera Data Science Workbench.

Workaround: The Cloudera Data Science Workbench documentation includes more details about this known issue and its workarounds.

Cloudera Bug: CDH-69733, DSE-4316

UnsatisfiedLinkError observed when using Snappy compression in the spark2-shell

In CDS 2.3 release 2, when you use spark2-shell to read or write to a parquet table with snappy compression, the following UnsatisfiedLinkError occurs:
java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/nio/ByteBuffer;II)I
at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:561)
at parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62)
This happens because the 1.0.4.1 version of snappy needs to access the top-most class loader which collides with a change related to fixing userClassPathFirst (SPARK-18646).

Applications run with spark2-submit are not affected by this issue.

Workaround: Copy snappy-java-1.1.4.jar to /opt/cloudera/parcels/SPARK2/lib/spark2/jars/ on every node in the cluster. You can download snappy-java-1.1.4.jar from: https://repository.cloudera.com/cloudera/list/repo1/org/xerial/snappy/snappy-java/1.1.4/snappy-java-1.1.4.jar

Cloudera Bug: CDH-67889

Resolution: Upgrade to CDS 2.3 Release 3, which contains the fix.

Spark jobs fail when lineage collection is enabled

In CDS 2.3 release 2, Spark jobs fail when lineage is enabled because Cloudera Manager does not automatically create the associated lineage log directory (/var/log/spark2/lineage) on all required cluster hosts. Note that this feature is enabled by default in CDS 2.3 release 2.

Implement one of the following workarounds to continue running Spark jobs.

Workaround 1 - Deploy the Spark gateway role on all hosts that are running the YARN NodeManager role

Cloudera Manager only creates the lineage log directory on hosts with Spark 2 roles deployed on them. However, this is not sufficient because the Spark driver can run on any host that is running a YARN NodeManager. To ensure Cloudera Manager creates the log directory, add the Spark 2 gateway role to every cluster host that is running the YARN NodeManager role.

For instructions on how to add a role to a host, see the Cloudera Manager documentation: Adding a Role Instance

Workaround 2 - Disable Spark Lineage Collection

To disable the feature, log in to Cloudera Manager and go to the Spark 2 service. Click Configuration. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection. Click Save Changes.

Cloudera Bug: CDH-67643, CDH-68832

Partial Resolution: Upgrade to CDS 2.3 Release 3, which contains a partial fix. Spark jobs do not fail, but lineage is not collected.

Spark on Kubernetes is not supported

The "Spark On Kubernetes" feature of Apache Spark 2.3 (and higher) is currently not supported. This feature is currently still designated as experimental within Apache Spark.

ORC file format is not supported

Currently, Cloudera does not support reading and writing Hive tables containing data files in the Apache ORC (Optimized Row Columnar) format from Spark applications. Cloudera recommends using Apache Parquet format for columnar data. That file format can be used with Spark, Hive, and Impala.

Accessing multiple clusters simultaneously is not supported

Spark does not support accessing multiple clusters in the same application.

Parquet logical type TIMESTAMP_MICROS unavailable

Although SPARK-10365 introduces the Parquet logical type TIMESTAMP_MICROS, this logical type is not available in the Parquet support libraries included with CDS Powered by Apache Spark.

Spark SQL does not respect Sentry ACLs when communicating with Hive metastore

Even if user is configured via Sentry to not have read permission to a Hive table, a Spark SQL job running as that user can still read the table's metadata directly from the Hive metastore.

Cloudera Bug: CDH-33658

Empty result when reading Parquet table created by saveAsTable()

After a Parquet table is created by the saveAsTable() function, Spark SQL queries against the table return an empty result set. The issue is caused by the "path" property of the table not being written to the Hive metastore during the saveAsTable() call.

Cloudera Bug: CDH-60037

Affects: CDS 2.2 Release 1

Severity: High

Workaround: You can set the path manually before the call to saveAsTable():

val options = Map("path" -> "/path/to/hdfs/directory/containing/table")
df.write.options(options).saveAsTable("db_name.table_name")

Or you can add the path to the metastore when the table already exists, for example:

spark.sql("alter table db_name.table_name set SERDEPROPERTIES ('path'='hdfs://host.example.com:8020/warehouse/path/db_name.db/table_name')")
spark.catalog.refreshTable("db_name.table_name")

Resolution: Upgrade to CDS 2.2 Release 2, which contains the fix.

Spark 2 Version requirement for clusters managed by Cloudera Manager

Although Spark 1 and Spark 2 can coexist in the same CDH cluster, you cannot use multiple Spark 2 versions simultaneously in the same Cloudera Manager instance. All CDH clusters managed by the same Cloudera Manager Server must use exactly the same version of CDS Powered by Apache Spark. For example, you cannot use the built-in CDH Spark service, a CDS 2.1 service, and a CDS 2.2 service. You must choose only one CDS 2 Powered by Apache Spark release. Make sure to install or upgrade the CDS 2 service descriptor and parcels across all machines of all clusters at the same time.

Spark Standalone is not supported

Spark Standalone is not supported for CDS Powered by Apache Spark.

HiveOnSpark is not supported with CDS

The HiveOnSpark module is a CDH 5 component that has a dependency on Apache Spark 1.6. Because CDH 5 components do not have any dependencies on Spark 2, the HiveOnSpark module does not work with CDS Powered by Apache Spark. You can still use Spark 2 with Hive using other methods.

SparkOnHBase is not supported with CDS

The SparkOnHBase module is a CDH 5 component that has a dependency on Apache Spark 1.6. Because CDH 5 components do not have any dependencies on Spark 2, the SparkOnHBase module does not work with CDS Powered by Apache Spark. You can still use Spark 2 with HBase using other methods.

Using the JDBC Datasource API to access Hive or Impala is not supported

Dynamic allocation and Spark Streaming

If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.

Oozie Spark2 action is not supported

The Oozie Spark action is a CDH component that has a dependency on Spark 1.6. Because CDH components do not have any dependencies on Spark 2, the Oozie Spark action does not work with CDS Powered by Apache Spark.

SparkR is not supported

SparkR is not supported for CDS Powered by Apache Spark. (SparkR is also not supported in CDH with Spark 1.6.)

GraphX is not supported

GraphX is not supported for CDS Powered by Apache Spark. (GraphX is also not supported in CDH with Spark 1.6.)

Thrift server is not supported

The Thrift JDBC/ODBC server is not supported for CDS Powered by Apache Spark. (The Thrift server is also not supported in CDH with Spark 1.6.)

Spark SQL CLI is not supported

The Spark SQL CLI is not supported for CDS Powered by Apache Spark. (The Spark SQL CLI is also not supported in CDH with Spark 1.6.)

Rolling upgrades are not supported

Rolling upgrades are not possible from Spark 1.6 bundled with CDH, to CDS Powered by Apache Spark.

Package-based installation is not supported

CDS Powered by Apache Spark is only installable as a parcel.

Spark-Avro Library is not supported; Use Built-In Avro Data Source

The spark-avro library is not integrated into the CDS Powered by Apache Spark parcel. Starting with CDS 2.4 release 1, you can use the built-in Avro data source instead. For documentation, see Apache Avro Data Source Guide.

Hardware acceleration for MLlib is not supported

This feature, part of the GPL Extras package for CDH, is not supported with the CDS 2 Powered By Apache Spark. This feature is supported for Spark 1.6.

Cost based optimization is not supported

The cost based optimization feature is not supported in CDS Powered by Apache Spark. Do NOT set the spark.sql.cbo.enabled configuration option to true.

Running spark2-submit with --principal and --keytab arguments does not work in client mode

The spark2-submit script's --principal and --keytab arguments do not work with Spark-on-YARN's client mode. Use cluster mode instead.

Long-running apps on a secure cluster might fail if driver is restarted

If you submit a long-running app on a secure cluster using the --principal and --keytab options in cluster mode, and a failure causes the driver to restart after 7 days (the default maximum HDFS delegation token lifetime), the new driver fails with an error similar to the following:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token <token_info> can't be found in cache

Workaround: None

Affected Versions: All CDS 2.0, 2.1, and 2.2 releases

Fixed Versions: CDS 2.3 Release 2

Apache Issue: SPARK-23361

Cloudera Issue: CDH-64865