Known issues in 7.1.9 CHF 4

You must be aware of the known issues and limitations, the areas of impact, and workaround in Cloudera Runtime 7.1.9 CHF 4.

After upgrading the cluster from CDP Private Cloud Base 7.1.8 to CDP Private Cloud Base 7.1.9 and Ozone is in the Non-HA environment, an exception message is observed during the finalization of the Ozone upgrade.

During the finalization of the upgrade, ClassNotFoundException for org.cloudera.log4j.redactor.RedactorAppender class was identified. The error message is harmless as the upgrade is successful. The error existed previously and does not affect the Ozone service and its operation.

None.

When using S3A committer fs.s3a.committer.name=directory with fs.s3a.committer.staging.conflict-mode=replace to write to FSO buckets, the client fails with the following error.

DIRECTORY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Failed to find parent directory of xxxxxxxx at org.apache.hadoop.ozone.om.request.file.OMFileRequest.getParentID(OMFileRequest.java:1008) at org.apache.hadoop.ozone.om.request.file.OMFileRequest.getParentID(OMFileRequest.java:958) at org.apache.hadoop.ozone.om.request.file.OMFileRequest.getParentId(OMFileRequest.java:1038) at org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequestWithFSO.getDBOzoneKey(S3MultipartUploadCompleteRequestWithFSO.java:114) at org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest.validateAndUpdateCache(S3MultipartUploadCompleteRequest.java:157) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:378) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:363) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

This occurs because S3A uses multipart upload to commit job results in a batch. The staging committer's replace mode deletes the target directory before completing MPU. The problem is that FSO does not create intermediate directories during MPU, it does only for regular file/dir/key requests.

Use fs.s3a.committer.name=magic for ** affected versions.

OPSAPS-69846: If Ozone is installed with custom kerberos principals for its roles, operations on encrypted buckets can fail as Ranger KMS does not have its proxy users and groups configured for the custom S3 Gateway user.

Add the following configurations in Ranger-kms safety valve based on the custom s3g user. In this case , the user is s3gfoo0. The parameters are hadoop.kms.proxyuser.s3gfoo0.hosts = * hadoop.kms.proxyuser.s3gfoo0.groups = *

CDPD-66508: Shallow listing is enabled by default in 7.1.9. There is a bug in shallow listing that causes the below error when listing an empty directory in a LEGACY/OBS bucket:

Error when listing an empty directory in a LEGACY/OBS bucket: mkdir: getFileStatus on s3a://testbucket/data/test: com.amazonaws.services.s3.model.AmazonS3Exception: Server Error (Service: Amazon S3; Status Code: 500; Error Code: 500 Server Error; Request ID: null; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:500 Server Error: Server Error (Service: Amazon S3; Status Code: 500; Error Code: 500 Server Error; Request ID: null; S3 Extended Request ID: null; Proxy: null)

In S3 gateway log: Caused by: java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0 at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248) at java.base/java.util.Objects.checkIndex(Objects.java:372) at java.base/java.util.ArrayList.remove(ArrayList.java:535) at org.apache.hadoop.ozone.client.OzoneBucket$KeyIterator.getNextShallowListOfKeys(OzoneBucket.java:1234) at org.apache.hadoop.ozone.client.OzoneBucket$KeyIterator.getNextListOfKeys(OzoneBucket.java:1136) at org.apache.hadoop.ozone.client.OzoneBucket$KeyIterator.hasNext(OzoneBucket.java:1110) at org.apache.hadoop.ozone.s3.endpoint.BucketEndpoint.get(BucketEndpoint.java:208) at jdk.internal.reflect.GeneratedMethodAccessor90.invoke(Unknown Source)

Disable shallow listing by performing the following steps:

Log in to Cloudera Manager
Navigate to Clusters
Select the Ozone service
Go to Configurations
In S3 Gateway Advanced Configuration Snippet (Safety Valve) for ozone-conf/ozone-site.xml, set ozone.s3g.list-keys.shallow.enabled = false.

CDPD-64394: OzoneManager may fail to start and have logs java.lang.IllegalArgumentException: Trying to set updateID to XXX which is not greater than the current value of XXX for OMKeyInfo{XXX} in OzoneManager log file.

Currently, this issue cannot be auto recovered. You must contact Cloudera support.

CDPD-65801: This is an intermittent issue in native RocksDB tool which causes corruption to in-memory RocksDB metadata.

Set ozone.om.snapshot.load.native.lib to false and restart the OM.

CDPD-66142: When Solr is slow/down, Solr takes lot of time to respond to Recon Heatmap query or sometimes doesn't respond at all which makes Recon heatmap trying to load the heatmap data forever. This issue will be taken up in future releases and solution could be to introduce a health check for Solr or timeout the Recon query to Solr and show a meaningful message over Recon UI -> "Solr is not responding"

Stop Recon
Restart Solr
Start Recon

CDPD-66247: TestOzoneFileSystem.testListStatusOnKeyNameContainDelimiter is intermittent

None

CDPD-66382: When Bucket layout is LEGACY and ozone.om.enable.filesystem.paths property is set to true, then delete will not work completely if keyName contains "/".

None.

CDPD-66252: du space calculation support for OBS and LEGACY (fsPath disabled).

du space for OBS buckets and LEGACY(fspath disabled) can be seen using CLI command.

OPSAPS-69539: CDP Runtime 7.1.9 does not support Oracle JDK 8u401 or OpenJDK 1.8.0_402 (8u402). Some services will fail to start. This can be a problem on RHEL 9.x as version 8u402 is the default OpenJDK 8 installed by the OS.: Workaround is to install an earlier version of JDK 8. For example Oracle jdk-8u291 / 1.8.0_291, or OpenJDK 8u292 / 1.8.0_292.
A fresh install of 7.1.9 or its CHFs does not allow user to bypass the Setup Database screen for YARN Queue Manager: YARN Queue Manager in Cloudera Data Platform (CDP) Private Cloud Base 7.1.9 CHF 2 does not require you to install a PostGres database, therefore users should not see the Setup Database screen and should be able to skip the Setup Database screen. With this known issue, users who are conducting a fresh install of 7.1.9 or its CHFs are not able to bypass the Setup Database screen as expected.; When conducting a fresh install of YARN Queue Manager in 7.1.9 or its CHFs, you must ensure that you have both CDP and Cloudera Manager upgraded to 7.1.9 CHF 2.

When you reach the Setup Database screen in the Cloudera Manager installation wizard for Queue Manager, enter any dummy values for the following fields:

Database name: configstore

Database Username: dbuser

Database Password: dbpassword

Database Hostname: localhost

YARN Queue Manager will not connect to PostGres with the above details and will fall back to the embedded database.

Run the following script command in a browser console to enable the Continue button:
document.querySelector('.btn.next').removeAttribute('disabled');

Click Continue and proceed with the YARN Queue Manager installation.

After installation is complete, SSH into the host that has Queue Manager installed, and run this command: sed -i 's/migrationCompleted=true/migrationCompleted=false/' /var/lib/hadoop-yarn/migration.properties
note
Enable Queue Manager in the YARN configurations, and restart YARN.

Restart YARN Queue Manager.

CDPD-61524: Ozone Storage Container Manager fails to start on upgrading from CDP Private Cloud Base 7.1.6 to 7.1.9 CHF1. Also, if you have upgraded from CDP Private Cloud Base 7.1.6 to 7.1.7 or 7.1.8 and then to 7.1.9, the upgrade fails.: None. Cloudera recommends you to reach out to the Support before performing the upgrade to CDP Private Cloud Base 7.1.9.
CDPD-62254: Ozone is not supported on SLES15 with CHF1.: If your cluster has Ozone, Cloudera recommends you to not upgrade to 7.1.9 CHF1.
CDPD-62834: Status of the deleted table is seen as ACTIVE in Atlas after the completion of navigator2atlas migration process: The status of the deleted table displays as ACTIVE.; None
CDPD-62837: During the navigator2atlas process, the hive_storagedesc is incomplete in Atlas: For the hive_storagedesc entity, some of the attributes are not getting populated.; None
CDPD-63690: RuntimeException encountered when generating snapshotDiff report between 2 snapshots: When snapshot feature is enabled, KeyDeletingService, SSTFilteringService and SnapDiff thread fall into a deadlock when accessing Snapshot Cache.; Restart the Ozone Manager.
CDPD-64238: Snapshot diff request failing when setting ozone.om.snapshot.db.max.open.files=-1: When snapshot feature is enabled, KeyDeletingService, SSTFilteringService and SnapDiff thread fall into a deadlock when accessing Snapshot Cache.; Restart the Ozone Manager.
CDPD-76035: Resource lookup for Atlas service is failing: Once the Atlas configuration snippet atlas.authentication.method.file is enabled and a classification is created, these do not synchronize correctly to the Type Category resource field setting of Apache Ranger. The newly created classification won't be able to be selected as the Type Name.
CDPD-79160: NPE while deleting BusinessMetadata: If business metadata is created without adding any applicable types, a NullPointerException is produced when we try to delete that business metadata.; None; Apache Jira: ATLAS-4863
CDPD-76789: Creating tag with name description throws java.lang.ClassCastException: Creating classification with reserved names such as "name", "description", "owner", "version", "serviceType" and "options" can lead to HTTP 500 error in Apache Atlas.; Avoid creating classification with reserved names such as "name", "description", "owner", "version", "serviceType" and "options".
CDPD-77435: RAZ: Import-hive on Cloudera Data Engineering cluster fails: Running import-hive command on RAZ enabled Cloudera Data Engineering cluster fails with error "NoClassDefFoundError: com/sun/jersey/core/spi/factory/ResponseBuilderImpl".
CDPD-75994: Post DL regular upgrade (non ZDU) to 7.3.1, "Exception in getKafkaConsumer ,WakeupException: null" is seen: After the data lake is upgraded to 7.3.1, sometimes Atlas Hook does not function when Apache Atlas and Apache Kafka are started at the same time, thus Atlas is unable to connect to Kafka while Kafka is still being set up. Atlas performs only three attempts.
Restart the cluster, after the upgrade to trigger to reconnect to Apache Kafka. The Kafka consumer creation should be retried if the Kafka service is unavailable during Atlas startup.
UnsupportedClassVersionError: JDK 8 deployments support Nashorn JavaScript engine, which is built-in and fully compatible, whereas JDK 17 deployments support GraalJS script engine due to unavailability of Nashorn.
When your cluster supports both JDK 8 and JDK 17, then while a Java application, running on JDK 8, uses generic interfaces like ScriptEngine (from the javax.script package), the ScriptEngineManager class scans the classpath for available script engine implementations through the service provider mechanism, and detects the GraalJS as a provider for JavaScript, because in this case the GraalJS library (version 22.3.0) is also included on the classpath. The ScriptEngineManager then attempts to instantiate it, when requesting a "js" or "javascript" engine, and triggers an UnsupportedClassVersionError.; Remove the GraalJS library from the classpath.
CDPD-79099: Unable to import hive tables using import script in Public Cloud: The validation of import .zip files can cause the org.apache.atlas.AtlasServiceException error if the .zip file contains Atlas shell entities missing mandatory attributes such as a name. Therefore, the import of shell entities fails.

Technical Service Bulletins

TSB 2024-749: Possible Ozone Snapshot Chain Corruption in 7.1.9.0: The Apache Ozone (Ozone) snapshots feature provides the ability to save consistent and immutable copies of the Ozone namespace. A chain of snapshots is tracked and maintained in the Ozone RocksDB that is utilized to calculate the difference between two snapshots required for the snapDiff API. A race condition on the snapshot delete path was found during testing. This condition might result in two consecutive snapshots pointing to the same parent, which breaks the snapshot chain.
Upstream JIRA: HDDS-10524, HDDS-10590, and HDDS-9198
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2024-749: Possible Ozone Snapshot Chain Corruption in 7.1.9.0
TSB 2025-820: Potential Data Integrity Issues Found in Ozone: The Cloudera Engineering team has identified the following data integrity issues with Apache Ozone (Ozone):

In certain situations, handling of failure paths when recovering from disk hardware failures, disk full situations, or over-replication can result in the incorrect deletion of some storage containers on those disk(s). In rare cases, all replicas of the container can be affected, leading to the data within that container becoming unavailable. Under certain extreme conditions, permanent data loss could occur.
Reference: CDPD-83416

A bug in the snapshot deep cleaning service and the object deletion path can lead to potential missing blocks of a snapshot key. This can happen only for the keys that were deleted from the active object store after the snapshot was created.
Reference: CDPD-83417
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2025-820: Potential Data Integrity Issues Found in Ozone
TSB 2025-835: Dry run of incremental Ozone replication can cause failure to replicate some changes in Cloudera Replication Manager: Executing the "Dry Run" action for Ozone replication schedules with a "Listing type" of "Incremental only" or "Incremental with fallback to full file listing" will result in a run where the changes are not replicated and also omitted from the subsequent replication runs.
Unless a "Full file listing" replication run is executed, the changes made between the dry run and the previous run are not replicated to the target. Such a scenario may occur when, during the dry run action of an Ozone replication policy with INCREMENTAL_ONLY and INCREMENTAL_WITH_FALLBACK_TO_FULL_FILE_LISTING replication type, generates a temporary snapshot on the source, which doesn't get deleted. On the next incremental run, all changes that occurred on the source Ozone bucket between the last successful run and the last dry run operation, will go unnoticed by the Replication Manager. This situation results in the failure to replicate such changes, to the destination Ozone bucket.
Knowledge article: For the latest update on this issue see the corresponding Knowledge article: TSB 2025-835: Dry run of incremental Ozone replication can cause failure to replicate some changes in Cloudera Replication Manager