Fixed Issues in Spark

Review the list of Spark issues that are resolved in Cloudera Runtime 7.2.18.

CDPD-3038: Launching pyspark displays several HiveConf warning messages
When pyspark starts, several Hive configuration warning messages are displayed, similar to the following:
19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.checked.expressions does not exist
19/08/09 11:48:04 WARN conf.HiveConf: HiveConf of name hive.tez.cartesian-product.enabled does not exist
These errors can be safely ignored and this issue has been fixed.
CDPD-65717: SPARK-46793 Revert S3A endpoint fixup logic of SPARK-35878
SPARK-46793. Revert S3A endpoint fixup logic of SPARK-35878
CDPD-64638: Slowness / broadcast timeout issues due to SPARK-33290: REFRESH TABLE should invalidate cache even though the table itself may not be cached (Spark 2.4.8)
Slowness / broadcast timeout issues could occur due to SPARK-33290 in case of Spark 2.4.8. A new legacy spark.sql.legacy.refreshOnlyCachedTables feature flag has been introduced to restore the behavior prior to Spark 2.4.8. If spark.sql.legacy.refreshOnlyCachedTables is set to false (default), REFRESH TABLE should invalidate cache even though the table itself may not be cached, this was introduced with SPARK-33290 in Spark 2.4.8. When set to true, restore the behavior prior to Spark 2.4.8. I have manually tested with customer data which caused timeout / slowness issues.
CDPD-64546: Performance: Spark TPCDS Queries are slower in 7.2.18 compared to 7.2.17
Fixed with disabling checksum on the client side while reading data. The read performance is similar as earlier showing no regressions.
CDPD-61564: Spark - Caused by: java.lang.NoClassDefFoundError: org/datanucleus/store/query/cache/QueryCompilationCache
Upgraded datanucleus-core dependency to 5.2.10
CDPD-57535: Revert: CDPD-48171: Temporary workaround pinning snakeyaml to 2.0 not vulnerable to CVE-2022-1471
Reverted back from snakeyaml 2.0. The snakeyaml's Representer constructor has been added back. The other reverted constructors can be found here: https://bitbucket.org/snakeyaml/snakeyaml/commits/3e755d254aeaa902675053047fd53368a175565a/raw
CDPD-58558: Simple DML insert into table via spark3-shell sparks.sql is creating orphan spark_process in atlas
Does not create spark_process entity in case of INSERT INTO ... VALUES ... Only the INSERT INTO ... SELECT ... action may create spark_process entity in Atlas based on these official documentations: https://docs.cloudera.com/cdp-private-cloud-base/7.1.8/atlas-reference/topics/atlas-spark-actions.html https://docs.cloudera.com/runtime/7.2.17/atlas-reference/topics/atlas-spark-actions.html
CDPD-58191: Spark - Upgrade kubernetes library to 5.7.4/5.8.1/5.10.2/5.11.2+ due to CVE-2021-4178
Upgraded kubernetes-client dependency to 5.7.4
CDPD-58080: Backport SPARK-32951 to Spark 2
SPARK-32951 Foldable propagation from Aggregate
CDPD-56594: Lineage (spark_process) is not created for views created on iceberg tables
Added CREATE VIEW lineage support to Spark Atlas Connector for Spark3 which is required for Iceberg tables
CDPD-56342: Upgrade Parquet to 1.12.3 in Spark
Upgraded Parquet dependency to 1.12.3
CDPD-55243: Fix case sensitivity of Iceberg's CachingCatalog
Previously, using inconsistent casing for database and table names of Iceberg tables in queries can lead to Spark reading a stale cached snapshot after a write to the table (append, update, delete) in the same Spark session. Now the cache is insensitive to the case of database and table names and is always refreshed on a write in the session.
CDPD-55116: Fix Spark vulnerability CVE-2023-22946
This fix is blacklisting “spark.submit.deployMode” and “spark.submit.proxyUser.allowCustomClasspathInClusterMode” spark configurations in Livy create session REST API. We have added a new Livy configuration “livy.server.session.allow-custom-classpath” to allow custom class path. In order to disable or rollback this fix, we can add “livy.server.session.allow-custom-classpath” as “true” in Livy configuration via the CM safety valve.
CDPD-44454: MAPREDUCE-7432. Make manifest committer default on abfs and gcs stores
MAPREDUCE-7432. Make manifest committer default on abfs and gcs stores
CDPD-44227: Ranger improvement - Roles Import/export API for ranger admin
Add Roles Import/export API for ranger admin

Apache patch information

  • SPARK-46793
  • SPARK-39441
  • SPARK-32951
  • LIVY-975
  • MAPREDUCE-7432