Known Issues in Apache Spark

This topic describes known issues and workarounds for using Spark in this release of Cloudera Runtime.

CDPD-22670 and CDPD-23103: There are two configurations in Spark, "Atlas dependency" and "spark_lineage_enabled", which are conflicted. The issue is when Atlas dependency is turned off but spark_lineage_enabled is turned on.
Run Spark application, Spark will log some error message and cannot continue. That can be restored by correcting the configurations and restarting Spark component with distributing client configurations.
CDPD-217: HBase/Spark connectors are not supported
The Spark HBase Connector (SHC) from HDP and the hbase-spark module from CDH are not supported.
Workaround: Migrate to the Apache HBase Connectors integration for Apache Spark (hbase-connectors/spark) available in CDP. More details on the integration for working with HBase data from Spark in CDP is available in the Cloudera Community article, HBase and Spark in CDP.
CDPD-3038: Launching pyspark displays several HiveConf warning messages
When pyspark starts, several Hive configuration warning messages are displayed, similar to the following:
19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.checked.expressions does not exist
19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.tez.cartesian-product.enabled does not exist
Workaround: These errors can be safely ignored.
CDPD-3293: Cannot create views (CREATE VIEW statement) from Spark
Apache Ranger in CDP disallows Spark users from running CREATE VIEW statements.
Workaround: Create the view using Hive or Impala.
CDPD-11720: HDFS ACLs not set on Hive external warehouse if Impala is not on cluster
If Impala is installed on the cluster, Impala sets HDFS ACLs on both the managed and external Hive warehouse. This allows Spark to write to tables created in the Hive external warehouse. If Impala is not installed, then these HDFS ACLs are not set, and Spark is not able to write to external tables created by Hive.
Workaround: Set HDFS ACLs manually.
CDPD-12622: Sentry GRANTS given during pre-migration to a role does not work post-migration in Spark Shell
Spark requires SELECT privileges on the default database, regardless of the databases referenced in the query.
Workaround:Add SELECT privileges in Ranger to the default database for all users who will run Spark queries.

Technical Service Bulletins

TSB 2021-441: CDP Powered by Apache Spark may incorrectly read/write pre-Gregorian timestamps
Spark may incorrectly read or write TIMESTAMP data for values before the start of the Gregorian calendar ('1582-10-15 00:00:00.0'). This could happen when Spark is:
  • Using dynamic partition inserts
  • Reading or writing from an ORC table when the:
    • spark.sql.hive.convertMetastoreOrc property is set to false. Its default value is true.
    • spark.sql.hive.convertMetastoreOrc property is set to true but the spark.sql.orc.impl property is set to hive. Its default is native.
  • Reading or writing from a Parquet table when the:
    • spark.sql.hive.convertMetastoreParquet property is set to false. Its default value is true.
Knowledge article
For the latest update on this issue see the corresponding Knowledge article: TSB 2021-441: Spark may incorrectly read/write pre-Gregorian timestamps