Known Issues in Apache Spark
Learn about the known issues in Spark, the impact or changes to the functionality, and the workaround.
- DOCS-9260: The Spark version is 2.4.5 for CDP Private Cloud 7.1.6. However, in the jar names the Spark version number is still 2.4.0. Hence, in the maven repositories the Spark version number is referred as 2.4.0. However, the Spark content is 2.4.5.
- CDPD-23817: In the upgraded Cluster, the permission of /tmp/spark is restricted due to the HDP configuration hive.exec.scratchdir=/tmp/spark.
- If you are using the /tmp/spark directory in the CDP cluster, you must provide the required additional Policies/ACL permissions.
- CDPD-22670 and CDPD-23103: There are two configurations in Spark, "Atlas dependency" and "spark_lineage_enabled", which are conflicted. The issue is when Atlas dependency is turned off but spark_lineage_enabled is turned on.
- Run Spark application, Spark will log some error message and cannot continue. That can be restored by correcting the configurations and restarting Spark component with distributing client configurations.
- CDPD-23007: Mismatch in the Spark Default DB Location. In HDP 3.1.5, hive_db entities have one attribute - 'location' which is configured to the '/managed' path. In fresh install of CDP 7.1.6, hive_db entities now have 2 attributes 'location' configured to '/external' path and 'managedLocation' configured to '/managed' path. In. the AM2CM migration (HDP 3.1.5 -> CDP 7.1.6), the 'location' attribute from hive_db entities in HDP 3.1.5 comes unaltered to CDP 7.1.6 and hence maps to '/managed' path.
- This issue arises only if you are upgrading from HDP 3.1.5 to CDP 7.1.6. If you are performing a fresh install of CDP 7.1.6, you can ignore this issue.
- CDPD-217: HBase/Spark connectors are not supported
- The Apache HBase Spark Connector
hbase-connectors/spark) and the Apache Spark - Apache HBase Connector (
shc) are not supported in the initial CDP release.
- CDPD-3038: Launching
pysparkdisplays several HiveConf warning messages
pysparkstarts, several Hive configuration warning messages are displayed, similar to the following:
19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.checked.expressions does not exist 19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.tez.cartesian-product.enabled does not exist
- These errors can be safely ignored.
- CDPD-2650: Spark cannot write ZSTD and LZ4 compressed Parquet to dynamically partitioned tables
- Use a different compression algorithm.
- CDPD-3293: Cannot create views (CREATE VIEW statement) from Spark
- Apache Ranger in CDP disallows Spark users from running
- Create the view using Hive or Impala.
- CDPD-3783: Cannot create databases from Spark
- Attempting to create a database using Spark results in an error
similar to the
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [sparkuser] does not have [ALL] privilege on [hdfs://ip-10-1-2-3.cloudera.site:8020/tmp/spark/warehouse/spark_database.db]);
- Create the database using Hive or Impala, or
specify the external data warehouse location in the
createcommand. For example:
sql("create database spark_database location '/warehouse/tablespace/external/hive/spark_database.db'")