Known Issues in Apache Spark
This topic describes known issues and workarounds for using Spark in this release of Cloudera Runtime.
- CDPD-22670 and CDPD-23103: There are two configurations in Spark, "Atlas dependency" and "spark_lineage_enabled", which are conflicted. The issue is when Atlas dependency is turned off but spark_lineage_enabled is turned on.
- Run Spark application, Spark will log some error message and cannot continue. That can be restored by correcting the configurations and restarting Spark component with distributing client configurations.
- CDPD-217: HBase/Spark connectors are not supported
- The Spark HBase Connector (SHC) from HDP and the hbase-spark module from CDH are not supported.
- CDPD-3038: Launching
pyspark
displays several HiveConf warning messages - When
pyspark
starts, several Hive configuration warning messages are displayed, similar to the following:19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.checked.expressions does not exist 19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.tez.cartesian-product.enabled does not exist
- CDPD-3293: Cannot create views (CREATE VIEW statement) from Spark
- Apache Ranger in CDP disallows Spark users from running
CREATE VIEW
statements.
- CDPD-11720: HDFS ACLs not set on Hive external warehouse if Impala is not on cluster
- If Impala is installed on the cluster, Impala sets HDFS ACLs on both the managed and external Hive warehouse. This allows Spark to write to tables created in the Hive external warehouse. If Impala is not installed, then these HDFS ACLs are not set, and Spark is not able to write to external tables created by Hive.
- CDPD-12622: Sentry GRANTS given during pre-migration to a role does not work post-migration in Spark Shell
- Spark requires SELECT privileges on the default database, regardless of the databases referenced in the query.
Technical Service Bulletins
- TSB 2021-441: CDP Powered by Apache Spark may incorrectly read/write pre-Gregorian timestamps
- Spark may incorrectly read or write TIMESTAMP data for values before the start of the
Gregorian calendar ('1582-10-15 00:00:00.0'). This could happen when Spark is:
- Using dynamic partition inserts
- Reading or writing from an ORC table when the:
spark.sql.hive.convertMetastoreOrc property
is set tofalse
. Its default value is true.spark.sql.hive.convertMetastoreOrc
property is set totrue
but thespark.sql.orc.impl property
is set tohive
. Its default is native.
- Reading or writing from a Parquet table when the:
spark.sql.hive.convertMetastoreParquet
property is set tofalse
. Its default value is true.
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2021-441: Spark may incorrectly read/write pre-Gregorian timestamps