Fixed Issues in Ozone
Cloudera Runtime 7.1.9 SP2 resolves identified Ozone functional errors and includes technical patches to improve service stability and performance.
- CDPD-92933: The delete pending keys summary was not displayed correctly in the Recon UI
- Previously, the Pending Delete Keys summary card on the Recon Overview page incorrectly displayed a value of 0 for all data. This occurred because the Recon UI was attempting to reference a field in the API response that was no longer in use. This issue has been resolved, and the summary card now accurately reflects the pending delete keys data.
- CDPD-62755: Ozone DataNode shares the same port with HDFS DataNode
- Previously, the Ozone DataNode client service
(hdds.datanode.client.port) and the HDFS DataNode HTTP server
(dfs.datanode.http.address) were both configured to use port 9864
by default. This port conflict prevented Ozone and HDFS DataNode services from running
simultaneously on the same node, resulting in an
Address already in useerror. This issue is now fixed, and the Ozone DataNode client port has been changed from 9864 to 19864 to avoid this conflict, restoring compatibility between Ozone and HDFS DataNode services on the same host. - CDPD-66919: The ozone admin reconfig command fails with security enabled
- Previously, the ozone admin reconfig command failed in clusters where Kerberos security was enabled, preventing administrators from dynamically updating configurations. This issue is now fixed, and the command now works correctly in secure clusters, allowing properties for the Ozone Manager (OM), Storage Container Manager (SCM), and Datanode (DN) to be reconfigured without requiring a cluster restart.
- CDPD-91562: Ozone certificate expiration dates calculated incorrectly during Daylight Saving Time
- Previously, when generating Ozone certificates in time zones that observe Daylight Saving Time (DST), the expiration duration could be miscalculated by one hour. This occurred because the internal logic did not account for time zone offsets during DST transitions, potentially causing premature certificate validation failures. This issue has been resolved, and certificate duration calculations now accurately account for daylight saving impacts.
- CDPD-97209: Container Balancer attempts to move unhealthy containers due to inconsistent validation
- Previously,
ContainerBalancerSelectionCriteriaapplied a less stringent health check than MoveManager. This allowed unhealthy containers to be selected for balancing, only to be rejected later by MoveManager's stricter criteria, resulting in wasted cycles and inefficient balancing operations.This issue has been resolved by unifying the validation logic, ensuring that the same rigorous health checks are applied upfront. Additionally, if a balancing attempt fails, source Datanodes are now correctly returned to the priority queue so they remain eligible for subsequent balancing cycles.
- CDPD-96710: Non-recursive directory deletion fails with the Directory is not empty error
- Previously, the S3 clients could unexpectedly encounter a
Directory is not empty error when attempting to recursively delete a
directory through the S3 Gateway on Ozone. This was driven by a transient timing race
condition that occurred when the Ozone Manager's internal double buffer had not yet
flushed child deletion transactions to the database. This issue is fixed now by properly
checking key tombstones in cache in
checkSubFileExistsorcheckSubDirectoryExists. - CDPD-95716: File descriptor leak in the Ozone Manager during checkpoint transfers
- Previously, a directory stream was not properly closed in
the
OMDBCheckpointServletInodeBasedXfer.writeDBToArchive()method. This caused file descriptors to accumulate on the Ozone Manager host, potentially leading to performance degradation or service instability over time. This issue has been resolved by ensuring the directory stream is explicitly closed after execution. - CDPD-93006: Snapshot read cache lock is not released during runtime exceptions
- Previously, if a runtime exception occurred during Ozone Manager (OM) snapshot processing, the snapshot read cache lock could remain held instead of being released. This unreleased lock could stall downstream operations, leading to deadlocks or failures during the OM bootstrap process. This issue has been resolved by ensuring that exception handling routines properly and safely release the read cache lock.
- CDPD-94578: Out-of-lock initialization causes potential race conditions during OM snapshot bootstrap
- Previously, a race condition could occur during the Ozone Manager (OM) snapshot bootstrap process because the snapshot directory list was initialized outside of the critical lock. If changes occurred before the lock was finally secured, the initialized list could become stale, causing potential metadata inconsistencies. This issue has been resolved; the system now safely re-reads the snapshot list from the OM checkpoint database immediately after acquiring the lock.
- CDPD-94576: Configuration property ozone.om.snapshot.db.max.open.files rejects -1 as a valid value
- Previously, the validation logic for the ozone.om.snapshot.db.max.open.files configuration property incorrectly rejected -1 as an invalid value. This restriction prevented administrators from setting the value to -1, which instructs the underlying RocksDB instance to keep all file descriptors open to avoid expensive table cache calls and optimize performance. This issue has been resolved, and the validation logic now properly accepts -1.
- CDPD-92655: Spurious WARN log messages emitted by
RocksDBCheckpointDiffercompaction tracker - Previously, the RocksDBCheckpointDiffer compaction tracker emitted spurious WARN log messages when a snapshot was not present at the beginning of a compaction job. This issue has been resolved. The compaction tracker now monitors the compaction job ID and only issues a warning if a snapshot was explicitly verified to be present when the compaction process started.
- CDPD-89578: Ozone Manager tracks snapshot SST files from the active database, causing incorrect snapshot diffs
- Previously, during Ozone Manager (OM) snapshot
initialization, the
LocalSnapshotMetadataprocess incorrectly tracked SST files from the active object store database instead of the isolated snapshot checkpoint. If a background compaction was committed immediately after the checkpoint was created, this mismatch resulted in incorrect snapshot diff data. This issue has been resolved by ensuring that snapshot metadata generation strictly tracks files directly from the snapshot checkpoint. - CDPD-61603: Temporary files accumulate on the host after Ozone Manager bootstrapping
- Previously, temporary files generated during the Ozone Manager (OM) bootstrapping process (located in the /tmp directory) were not automatically deleted after the process completed. This caused unnecessary disk space consumption and file accumulation over time on the OM host. This issue has been resolved, and these temporary files are now automatically cleaned up upon successful bootstrap completion.
- CDPD-94559: Insufficient locking during snapshot checkpoints causes potential data inconsistencies
- Previously, the Ozone Manager (OM) only acquired an individual snapshot lock during the final stage of copying snapshot data. Because the full snapshot cache lock was not held while taking the active database checkpoint during a snapshot export, concurrent operations could introduce data inconsistencies. This issue has been resolved by ensuring that the comprehensive snapshot cache lock is safely acquired before the active database checkpoint is initiated.
- CDPD-89287: Snapshot creation log messages lack bucket and volume context
- Previously, log messages generated during Ozone snapshot creation only included the metadata directory and the snapshot name. This limited context made it difficult for system administrators to identify or trace specific snapshots within the cluster logs. This issue has been resolved, and these log messages have been enhanced to explicitly include both the bucket and volume names for better traceability.
- CDPD-98374: Incomplete Erasure Coded (EC) pipelines are cached,
causing persistent "
insufficient DataNodes" errors - Previously, the Ozone Manager (OM) container pipeline cache
could store incomplete Erasure Coded (EC) pipelines if certain DataNodes had not yet
reported their status. This resulted in false "
insufficient datanodes" errors when clients attempted to read EC files. Because these incomplete pipelines remained cached for up to 6 hours, the read failures would persist long after the physical DataNodes had returned to a normal, operational state. This issue has been resolved by ensuring that incomplete EC pipelines are excluded from the container pipeline cache. - CDPD-101434: Risk of missing blocks in Ratis pipelines due to initial under-replication
- Previously, when data was written through a Ratis pipeline, blocks were occasionally written only to the leader DataNode, despite the system returning a successful write acknowledgment to the client. Consequently, if the leader DataNode failed or went offline before the data was replicated to other nodes, the associated blocks or chunk data went missing, resulting in temporary data unavailability. This vulnerable state persisted until the Storage Container Manager (SCM) background processes detected the under-replicated container and completed the proper replication. This issue is fixed now.
