Apache Spark Known Issues
Continue reading:
- CVE-2019-10099: Apache Spark local files left unencrypted
- Partitioned tables in the Avro format with upper case column names return NULL values from Spark SQL
- Missing results in Hive, Spark, Pig, Custom MapReduce jobs, and other Java applications when filtering Parquet data written by Impala
- Apache Spark experimental features are not supported unless specifically identified as supported
- Spark 2.x not supported on unmanaged clusters
- ADLS not Supported for All Spark Components
- IPython / Jupyter notebooks not supported
- Certain Spark Streaming features not supported
- Certain Spark SQL features not supported
- Spark Dataset API not supported
- JDBC Datasource API not supported
- GraphX not supported
- SparkR not supported
- Scala 2.11 not supported
- Spark Streaming cannot consume from secure Kafka till it starts using Kafka 0.9 Consumer API
- Tables saved with the Spark SQL DataFrame.saveAsTable method are not compatible with Hive
- Cannot create Parquet tables containing date fields in Spark SQL
- Spark SQL does not support the union type
- Spark SQL does not respect size limit for the varchar type
- Spark SQL does not support the char type
- Spark SQL does not support transactional tables
- Spark SQL does not prevent you from writing key types not supported by Avro tables
- Spark SQL does not support timestamp in Avro tables
- Spark SQL does not support all ‘ANALYZE TABLE COMPUTE STATISTICS’ syntax
- Spark SQL statements that can result in table partition metadata changes may fail
- Spark SQL does not respect Sentry ACLs when communicating with Hive metastore
- Dynamic allocation and Spark Streaming
- Spark uses Akka version 2.2.3
- Spark standalone mode does not work on secure clusters
- Limitation with Region Pruning for HBase Tables
- Input Format Change on spark.yarn.am.waitTime
- History link in ResourceManager web UI broken for killed Spark applications
- Spark application using dynamic allocation fails with java.lang.AssertionError
CVE-2019-10099: Apache Spark local files left unencrypted
Certain operations in Spark leave local files unencrypted on disk, even when local file encryption is enabled with “spark.io.encryption.enabled”.
This includes cached blocks that are fetched to disk (controlled by spark.maxRemoteBlockSizeFetchToMem) in the following cases:
- In SparkR when parallelize is used
- In Pyspark when broadcast and parallelize are used
- In Pyspark when python udfs is used
- CDH
- CDS Powered by Apache Spark
- CDH 5.15.1 and earlier
- CDH 6.0.0
- CDS 2.1.0 release 1 and release 2
- CDS 2.2.0 release 1 and release 2
- CDS 2.3.0 release 3
Users affected: All users who run Spark on CDH and CDS in a multi-user environment.
Date/time of detection: July 2018
Severity (Low/Medium/High): 6.3 Medium (CVSS AV:L/AC:H/PR:N/UI:R/S:U/C:H/I:H/A:N)
Impact: Unencrypted data accessible.
CVE: CVE-2019-10099
Immediate action required: Upgrade to a version of CDH containing the fix.
Workaround: Do not use of pyspark and the fetch-to-disk options.
- CDH 5.15.2
- CDH 5.16.0
- CDH 6.0.1
- CDS 2.1.0 release 3
- CDS 2.2.0 release 3
- CDS 2.3.0 release 4
Partitioned tables in the Avro format with upper case column names return NULL values from Spark SQL
When you use Spark SQL to query external partitioned Hive tables created in the Avro format and which contain upper case column names, Spark SQL returns NULL values for the upper case column names.
Workaround: In Spark 1.6, create aliases that do not contain upper case characters for each column name.
Affected Versions: CDH 5.8.2 - CDH 5.14.3
Fixed Versions: CDH 5.14.4 and later
Apache Bug: SPARK-15848
Cloudera Bug: CDH-72396, CDH-45978
Missing results in Hive, Spark, Pig, Custom MapReduce jobs, and other Java applications when filtering Parquet data written by Impala
Apache Hive and Apache Spark rely on Apache Parquet's parquet-mr Java library to perform filtering of Parquet data stored in row groups. Those row groups contain statistics that make the filtering efficient without having to examine every value within the row group.
Recent versions of the parquet-mr library contain a bug described in PARQUET-1217. This bug causes filtering to behave incorrectly if only some of the statistics for a row group are written. Starting in CDH 5.13, Apache Impala populates statistics in this way for Parquet files. As a result, Hive and Spark may incorrectly filter Parquet data that is written by Impala.
In CDH 5.13, Impala started writing Parquet's null_count metadata field without writing the min and max fields. This is valid, but it triggers the PARQUET-1217 bug in the predicate push-down code of the Parquet Java library (parquet-mr). If the null_count field is set to a non-zero value, parquet-mr assumes that min and max are also set and reads them without checking whether they are actually there. If those fields are not set, parquet-mr reads their default value instead.
For integer SQL types, the default value is 0, so parquet-mr incorrectly assumes that the min and max values are both 0. This causes the problem when filtering data. Unless the value 0 itself matches the search condition, all row groups are discarded due to the incorrect min/max values, which leads to missing results.
- Hive
- Spark
- Pig
- Custom MapReduce jobs
- CDH 5.13.0, 5.13.1, 5.13.2, and 5.14.0
- CDS 2.2 Release 2 Powered by Apache Spark and earlier releases on CDH 5.13.0 and later
Who Is Affected: Anyone writing Parquet files with Impala and reading them back with Hive, Spark, or other Java-based components that use the parquet-mr libraries for reading Parquet files.
Severity (Low/Medium/High): High
Impact: Parquet files containing null values for integer fields written by Impala produce missing results in Hive, Spark, and other Java applications when filtering by the integer field.
-
Upgrade
You should upgrade to one of the fixed maintenance releases mentioned below.
-
Workaround
This issue can be avoided at the price of performance by disabling predicate push-down optimizations:-
In Hive, use the following SET command:
SET hive.optimize.ppd = false;
-
In Spark, disable the following configuration setting:
--conf spark.sql.parquet.filterPushdown=false
-
- CDH 5.13.3 and higher
- CDH 5.14.2 and higher
- CDH 5.15.0 and higher
- CDS 2.3 Release 2 and higher
For the latest update on this issue, see the corresponding Knowledge Base article:
Apache Spark experimental features are not supported unless specifically identified as supported
If an Apache Spark feature or API is identified as experimental, in general Cloudera does not provide support for it.
Spark 2.x not supported on unmanaged clusters
Launching Spark 2.x jobs from machines not managed by Cloudera Manager is not supported.
ADLS not Supported for All Spark Components
Microsoft Azure Data Lake Store (ADLS) is a cloud-based filesystem that you can access through Spark applications. Spark with Kudu is not currently supported for ADLS data. (Hive-on-Spark is available for ADLS in CDH 5.12 and higher.)
IPython / Jupyter notebooks not supported
- The IPython notebook system (renamed to Jupyter as of IPython 4.0) is not supported.
Certain Spark Streaming features not supported
- The mapWithState method is unsupported because it is a nascent unstable API.
- Affected Versions: All CDH 5 versions
- Bug: SPARK-2629.
Certain Spark SQL features not supported
- Thrift JDBC/ODBC server
- Spark SQL CLI
Spark Dataset API not supported
Cloudera distribution of Spark 1.6 does not support the Spark Dataset API. However, Spark 2.0 and higher supports the Spark Dataset API.
JDBC Datasource API not supported
Using the JDBC Datasource API to access Hive or Impala is not supported
GraphX not supported
Cloudera does not support GraphX.
Affected Versions: All CDH 5 versions.
SparkR not supported
Cloudera does not support SparkR.
Affected Versions: All CDH 5 versions.
Scala 2.11 not supported
Spark in CDH 5.x does not support Scala 2.11 because it is binary incompatible, and also not yet full-featured. However, Spark 2.0 and higher supports Scala 2.11.
Affected Versions: All CDH 5 versions.
Spark Streaming cannot consume from secure Kafka till it starts using Kafka 0.9 Consumer API
Bug: SPARK-12177.
Workaround: Use Spark 2.
Tables saved with the Spark SQL DataFrame.saveAsTable method are not compatible with Hive
Writing a DataFrame directly to a Hive table creates a table that is not compatible with Hive; the metadata stored in the metastore can only be correctly interpreted by Spark. For example:
val hsc = new HiveContext(sc) import hsc.implicits._ val df = sc.parallelize(data).toDF() df.write.format("parquet").saveAsTable(tableName)
creates a table with this metadata:
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
This is also occurs when using explicit schema, such as:
val schema = StructType(Seq(...)) val data = sc.parallelize(Seq(Row(...), …)) val df = hsc.createDataFrame(data, schema) df.write.format("parquet").saveAsTable(tableName)
Affected Versions: CDH 5.5 and higher
Workaround: Explicitly create a Hive table to store the data. For example:
df.registerTempTable(tempName) hsc.sql(s""" CREATE TABLE $tableName ( // field definitions ) STORED AS $format """) hsc.sql(s"INSERT INTO TABLE $tableName SELECT * FROM $tempName")
Cloudera Bug: CDH-33639
Cannot create Parquet tables containing date fields in Spark SQL
Exception in thread "main" org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.java.lang.UnsupportedOperationException: Parquet does not support date. at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:433)
This is due to a limitation (HIVE-6384) in the version of Hive (1.1) included in CDH 5.5.0.
Cloudera Bug: CDH-33640
Spark SQL does not support the union type
Tables containing union fields cannot be read or created using Spark SQL.
Affected Versions: CDH 5.5.0 and higher
Cloudera Bug: CDH-33641
Spark SQL does not respect size limit for the varchar type
Spark SQL treats varchar as string (that is, there no size limit). The observed behavior is that Spark reads and writes these columns as regular strings; if inserted values exceed the size limit, no error will occur. The data will be truncated when read from Hive, but not when read from Spark.
Affected Versions: CDH 5.5.0 and higher
Bug: SPARK-5918
Cloudera Bug: CDH-33642
Spark SQL does not support the char type
Spark SQL does not support the char type (fixed-length strings). Like unions, tables with such fields cannot be created from or read by Spark.
Affected Versions: CDH 5.5.0 to CDH 5.6.1
Fixed in Versions: CDH 5.7.0 and higher
Cloudera Bug: CDH-33643
Spark SQL does not support transactional tables
Spark SQL does not support Hive transactions ("ACID").
Affected Versions: CDH 5.5.0 and higher
Cloudera Bug: CDH-33644
Spark SQL does not prevent you from writing key types not supported by Avro tables
Spark allows you to declare DataFrames with any key type. Avro supports only string keys and trying to write any other key type to an Avro table will fail.
Affected Versions: CDH 5.5.0 and higher
Cloudera Bug: CDH-33648
Spark SQL does not support timestamp in Avro tables
Affected Versions: CDH 5.5.0 and higher
Cloudera Bug: CDH-33649
Spark SQL does not support all ‘ANALYZE TABLE COMPUTE STATISTICS’ syntax
ANALYZE TABLE <table name> COMPUTE STATISTICS NOSCAN works. ANALYZE TABLE <table name> COMPUTE STATISTICS (without noscan) and ANALYZE TABLE <table name> COMPUTE STATISTICS FOR COLUMNS both return errors.
Affected Versions: CDH 5.5.0 and higher
Cloudera Bug: CDH-33650
Spark SQL statements that can result in table partition metadata changes may fail
Because Spark does not have access to Sentry data, it may not know that a user has permissions to execute an operation and instead fail it. SQL statements that can result in table partition metadata changes, for example, "ALTER TABLE" or "INSERT", may fail.
Affected Versions: CDH 5.5.0 to CDH 5.6.1
Fixed in Versions: 5.7.0 and higher
Cloudera Bug: CDH-33446
Spark SQL does not respect Sentry ACLs when communicating with Hive metastore
Even if user is configured via Sentry to not have read permission to a Hive table, a Spark SQL job running as that user can still read the table's metadata directly from the Hive metastore.
Cloudera Bug: CDH-33658
Dynamic allocation and Spark Streaming
If you are using Spark Streaming, Cloudera recommends that you disable dynamic allocation by setting spark.dynamicAllocation.enabled to false when running streaming applications.
Spark uses Akka version 2.2.3
The CDH 5.5 version of Spark 1.5 differs from the Apache Spark 1.5 release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH 5.2. Apache Spark 1.5 uses Akka version 2.3.11.
Spark standalone mode does not work on secure clusters
Cloudera Bug: CDH-16998, "fixed" by workaround.
Workaround: On secure clusters, run Spark applications on YARN.
Limitation with Region Pruning for HBase Tables
When SparkSQL accesses an HBase table through the HiveContext, region pruning is not performed. This limitation can result in slower performance for some SparkSQL queries against tables that use the HBase SerDes than when the same table is accessed through Impala or Hive.
Input Format Change on spark.yarn.am.waitTime
The spark.yarn.am.waitTime property does not accept valid input formats, such as "250s".
Affected Versions: CDH 5.10.0 to CDH 5.10.2
Fixed in Versions: CDH 5.10.3, CDH 5.11.3, CDH 5.12.2, CDH 5.13.0 and higher
Workaround: Specify the value in milliseconds without "s" at the end.
History link in ResourceManager web UI broken for killed Spark applications
When a Spark application is killed, the history link in the ResourceManager web UI does not work.
Workaround: To view the history for a killed Spark application, see the Spark HistoryServer web UI instead.
Affected Versions: All CDH versions
Apache Issue: None
Cloudera Issue: CDH-49165
Spark application using dynamic allocation fails with java.lang.AssertionError
When dynamic allocation is enabled, Spark applications can fail with a java.lang.AssertionError: assertion failed error when the number of running executors exceeds the target number of executors. For example, this can occur if an application has many small tasks, and Spark requests a large amount of executors before reducing the target, triggering a race condition resulting in an error message similar to the following
ERROR yarn.YarnAllocator: Failed to launch executor 16 on container container_e11_1513498336069_33453_01_000022 java.lang.AssertionError: assertion failed
Workaround: Disable dynamic allocation by specifying spark.dynamicAllocation.enabled=false for the problematic job.
Affected Versions:
- CDH versions lower than 5.12
- CDH 5.12.0, 5.12.1
- CDH 5.13.0
Fixed Versions:
- CDH 5.12.2
- CDH 5.13.1
- CDH 5.14 and higher
Apache Issue: SPARK-17511
Cloudera Issue: CDH-56799