Known Issues in Apache Spark

Learn about the known issues in Apache Spark, the impact or changes to the functionality, and the workaround.

Known Issues identified in Cloudera Runtime 7.3.1.700 SP3 CHF 2

There are no new known issues identified in this release.

Known Issues identified in Cloudera Runtime 7.3.1.600 SP3 CHF 1

The following section lists the known issues identified in this release:

CDPD-95322: Missing Atlas lineage for Spark Iceberg tables from MERGE INTO
7.3.1, 7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500, 7.3.1.600
Spark SQL MERGE INTO statements on Iceberg tables are not transmitting lineage data to Atlas.
None.
CDPD-94393: RuntimeWarning Failed to add file" message appears even when Spark successfully loads files
7.3.1, 7.3.1.100, 7.3.1.200, 7.3.1.300, 7.3.1.400, 7.3.1.500, 7.3.1.600
In both Spark 2 and 3, due to an exception when attempting to add files to the Python path, the RuntimeWarning: Failed to add file message appears even when the Python JAR file is successfully loaded.
None. You can safely ignore the message as the file is loaded successfully and the message does not affect job completion.

Known Issues identified in Cloudera Runtime 7.3.1.500 SP3

There are no new known issues identified in this release.

Known Issues identified in Cloudera Runtime 7.3.1.400 SP2

There are no new known issues identified in this release.

Known Issues identified in Cloudera Runtime 7.3.1.300 SP1 CHF1

There are no new known issues identified in this release.

Known Issues identified in Cloudera Runtime 7.3.1.200 SP1

There are no new known issues identified in this release.

Known Issues identified in Cloudera Runtime 7.3.1.100 CHF1

The following section lists the known issues identified in this release:

CDPD-80239: Non-deterministic SQL expressions should set indeterminate map stage output level
7.3.1, 7.3.1.100 CFH1, 7.3.1.200 SP1, 7.3.1.300 SP1 CHF1, 7.3.1.400 SP2
Spark is supposed to handle non-deterministic keys, as long as they are marked with deterministic=false in their data type attributes. For Spark's random data this contract is not honored when there is a task failure. As a result, duplicate or missing data can be produced when the Spark executors are relaunched in new node managers.
Use the client configuration spark.global.deterministic to override any input-level deterministic configuration. If set to true, all inputs are deterministic, if set to false all inputs are indeterministic.

Known Issues identified in Cloudera Runtime 7.3.1

The following section lists the known issues identified in this release:

Spark 3: RAPIDS Accelerator is not available
7.3.1, 7.3.1.100 CHF1, 7.3.1.200 SP1, 7.3.1.300 SP1 CHF1, 7.3.1.400 SP2
The RAPIDS Accelerator for Apache Spark is currently not available in Cloudera Runtime7.3.1
None.
The CHAR(n) type handled inconsistently, depending on whether the table is partitioned or not.
7.3.1
7.3.1.100 CHF1
In upstream Spark 3 the spark.sql.legacy.charVarcharAsString configuration was introduced, but it does not solve all incompatibilities with Spark 2.

None. A new configuration spark.cloudera.legacy.charVarcharLegacyPadding will be introduced in a future version to keep compatibility with Spark 2, but it isn't available in 7.3.1.

Apache Jira: SPARK-33480