CDS Powered by Apache Spark Known Issues

The following sections describe the current known issues and limitations in CDS Powered by Apache Spark. In some cases, a feature from the upstream Apache Spark project is currently not considered reliable enough to be supported by Cloudera. For a number of integration features in CDH that rely on Spark, the feature does not work with CDS Powered by Apache Spark because CDH components are not introducing dependencies on Spark 2.

Continue reading:

PySpark broadcast variables fail when disk encryption is enabled
Spark SQL does not respect Sentry ACLs when communicating with Hive metastore
Empty result when reading Parquet table created by saveAsTable()
Spark 2 Version Requirement for Clusters Managed by Cloudera Manager
Spark Standalone
HiveOnSpark is not Supported with Spark 2
SparkOnHBase is not Supported with Spark 2
Using the JDBC Datasource API to access Hive or Impala is not supported
Dynamic allocation and Spark Streaming
Structured Streaming is not supported
Oozie Spark2 Action is not Supported
SparkR is not Supported
GraphX is not Supported
Thrift Server
Spark SQL CLI is not Supported
Rolling Upgrades are not Supported
Package Install is not Supported
Spark Avro is not Supported
Accessing Multiple Clusters Simultaneously Not Supported
Hardware Acceleration for MLlib is not Supported
Cost Based Optimization is not Supported
Running spark2-submit with --principal and --keytab arguments does not work in client mode
Long-running apps on a secure cluster might fail if driver is restarted

PySpark broadcast variables fail when disk encryption is enabled

When disk encryption is enabled, PySpark broadcast variables fail with the following stack trace:

Traceback (most recent call last): File "broadcast.py", line 37, in <module>
words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value
File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File
"pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of input

Workaround: None

Affected Versions: CDS 2.2 Release 4

Apache Issue: SPARK-26201

Cloudera Issue: CDH-76116

Spark SQL does not respect Sentry ACLs when communicating with Hive metastore

Even if user is configured via Sentry to not have read permission to a Hive table, a Spark SQL job running as that user can still read the table's metadata directly from the Hive metastore.

Cloudera Bug: CDH-33658

Empty result when reading Parquet table created by saveAsTable()

After a Parquet table is created by the saveAsTable() function, Spark SQL queries against the table return an empty result set. The issue is caused by the "path" property of the table not being written to the Hive metastore during the saveAsTable() call.

Cloudera Bug: CDH-60037

Affects: CDS 2.2 Release 1

Severity: High

Workaround: You can set the path manually before the call to saveAsTable():

val options = Map("path" -> "/path/to/hdfs/directory/containing/table")
df.write.options(options).saveAsTable("db_name.table_name")

Or you can add the path to the metastore when the table already exists, for example:

spark.sql("alter table db_name.table_name set SERDEPROPERTIES ('path'='hdfs://host.example.com:8020/warehouse/path/db_name.db/table_name')")
spark.catalog.refreshTable("db_name.table_name")

Resolution: Upgrade to CDS 2.2 release 2, which contains the fix.

Spark 2 Version Requirement for Clusters Managed by Cloudera Manager

Although Spark 1 and Spark 2 can coexist in the same CDH cluster, you cannot use multiple Spark 2 versions simultaneously in the same Cloudera Manager instance. All CDH clusters managed by the same Cloudera Manager Server must use exactly the same version of CDS Powered by Apache Spark. For example, you cannot use the built-in CDH Spark service, a CDS 2.1 service, and a CDS 2.2 service. You must choose only one CDS 2 Powered by Apache Spark release. Make sure to install or upgrade the CDS 2 service descriptor and parcels across all machines of all clusters at the same time.

Spark Standalone

Spark Standalone is not supported for Spark 2.

HiveOnSpark is not Supported with Spark 2

The HiveOnSpark module is a CDH 5 component that has a dependency on Apache Spark 1.6. Because CDH 5 components do not have any dependencies on Spark 2, the HiveOnSpark module does not work with CDS Powered by Apache Spark. You can still use Spark 2 with Hive using other methods.

SparkOnHBase is not Supported with Spark 2

The SparkOnHBase module is a CDH 5 component that has a dependency on Apache Spark 1.6. Because CDH 5 components do not have any dependencies on Spark 2, the SparkOnHBase module does not work with CDS Powered by Apache Spark. You can still use Spark 2 with HBase using other methods.

Using the JDBC Datasource API to access Hive or Impala is not supported

Dynamic allocation and Spark Streaming

If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.

Structured Streaming is not supported

Cloudera does not support the Structured Streaming API.

Oozie Spark2 Action is not Supported

The Oozie Spark action is a CDH component that has a dependency on Spark 1.6. Because CDH components do not have any dependencies on Spark 2, the Oozie Spark action does not work with Spark 2.

SparkR is not Supported

SparkR is not supported for Spark 2. (SparkR is also not supported in CDH with Spark 1.6.)

GraphX is not Supported

GraphX is not supported for Spark 2. (GraphX is also not supported in CDH with Spark 1.6.)

Thrift Server

The Thrift JDBC/ODBC server is not supported for Spark 2. (The Thrift server is also not supported in CDH with Spark 1.6.)

Spark SQL CLI is not Supported

The Spark SQL CLI is not supported for Spark 2. (The Spark SQL CLI is also not supported in CDH with Spark 1.6.)

Rolling Upgrades are not Supported

Rolling upgrades are not possible from Spark 1.6 bundled with CDH, to CDS 2 Powered by Apache Spark.

Package Install is not Supported

CDS 2 Powered by Apache Spark is only installable as a parcel.

Spark Avro is not Supported

The spark-avro library is not integrated into the CDS 2 Powered by Apache Spark.x parcel.

Accessing Multiple Clusters Simultaneously Not Supported

Spark does not support accessing multiple clusters in the same application.

Hardware Acceleration for MLlib is not Supported

This feature, part of the GPL Extras package for CDH, is not supported with CDS 2 Powered By Apache Spark. This feature is supported for Spark 1.6.

Cost Based Optimization is not Supported

The Cost Based Optimization feature is not supported in Spark 2.2. Do NOT set the spark.sql.cbo.enabled configuration option to true.

Running `spark2-submit` with `--principal` and `--keytab` arguments does not work in client mode

The spark2-submit script's --principal and --keytab arguments do not work with Spark-on-YARN's client mode. Use cluster mode instead.

Long-running apps on a secure cluster might fail if driver is restarted

If you submit a long-running app on a secure cluster using the --principal and --keytab options in cluster mode, and a failure causes the driver to restart after 7 days (the default maximum HDFS delegation token lifetime), the new driver fails with an error similar to the following:

Exception in thread "main" org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token <token_info> can't be found in cache