Fixed Issues in Ozone

CDPD-92933: The delete pending keys summary was not displayed correctly in the Recon UI

Previously, the Pending Delete Keys summary card on the Recon Overview page incorrectly displayed a value of 0 for all data. This occurred because the Recon UI was attempting to reference a field in the API response that was no longer in use. This issue has been resolved, and the summary card now accurately reflects the pending delete keys data.

CDPD-62755: Ozone DataNode shares the same port with HDFS DataNode

Previously, the Ozone DataNode client service (hdds.datanode.client.port) and the HDFS DataNode HTTP server (dfs.datanode.http.address) were both configured to use port 9864 by default. This port conflict prevented Ozone and HDFS DataNode services from running simultaneously on the same node, resulting in an Address already in use error. This issue is now fixed, and the Ozone DataNode client port has been changed from 9864 to 19864 to avoid this conflict, restoring compatibility between Ozone and HDFS DataNode services on the same host.

Apache JIRA: HDDS-9512

CDPD-66919: The ozone admin reconfig command fails with security enabled

Previously, the ozone admin reconfig command failed in clusters where Kerberos security was enabled, preventing administrators from dynamically updating configurations. This issue is now fixed, and the command now works correctly in secure clusters, allowing properties for the Ozone Manager (OM), Storage Container Manager (SCM), and Datanode (DN) to be reconfigured without requiring a cluster restart.

Apache JIRA: HDDS-10404

CDPD-91562: Ozone certificate expiration dates calculated incorrectly during Daylight Saving Time

Previously, when generating Ozone certificates in time zones that observe Daylight Saving Time (DST), the expiration duration could be miscalculated by one hour. This occurred because the internal logic did not account for time zone offsets during DST transitions, potentially causing premature certificate validation failures. This issue has been resolved, and certificate duration calculations now accurately account for daylight saving impacts.

Apache JIRA: HDDS-13781

CDPD-97209: Container Balancer attempts to move unhealthy containers due to inconsistent validation

Previously, ContainerBalancerSelectionCriteria applied a less stringent health check than MoveManager. This allowed unhealthy containers to be selected for balancing, only to be rejected later by MoveManager's stricter criteria, resulting in wasted cycles and inefficient balancing operations.

This issue has been resolved by unifying the validation logic, ensuring that the same rigorous health checks are applied upfront. Additionally, if a balancing attempt fails, source Datanodes are now correctly returned to the priority queue so they remain eligible for subsequent balancing cycles.

Apache JIRA: HDDS-14614

CDPD-96710: Non-recursive directory deletion fails with the Directory is not empty error

Previously, the S3 clients could unexpectedly encounter a Directory is not empty error when attempting to recursively delete a directory through the S3 Gateway on Ozone. This was driven by a transient timing race condition that occurred when the Ozone Manager's internal double buffer had not yet flushed child deletion transactions to the database. This issue is fixed now by properly checking key tombstones in cache in checkSubFileExists or checkSubDirectoryExists.

Apache JIRA: HDDS-14600

CDPD-95716: File descriptor leak in the Ozone Manager during checkpoint transfers

Previously, a directory stream was not properly closed in the OMDBCheckpointServletInodeBasedXfer.writeDBToArchive() method. This caused file descriptors to accumulate on the Ozone Manager host, potentially leading to performance degradation or service instability over time. This issue has been resolved by ensuring the directory stream is explicitly closed after execution.

Apache JIRA: HDDS-14376

CDPD-93006: Snapshot read cache lock is not released during runtime exceptions

Previously, if a runtime exception occurred during Ozone Manager (OM) snapshot processing, the snapshot read cache lock could remain held instead of being released. This unreleased lock could stall downstream operations, leading to deadlocks or failures during the OM bootstrap process. This issue has been resolved by ensuring that exception handling routines properly and safely release the read cache lock.

Apache JIRA: HDDS-13904

CDPD-94578: Out-of-lock initialization causes potential race conditions during OM snapshot bootstrap

Previously, a race condition could occur during the Ozone Manager (OM) snapshot bootstrap process because the snapshot directory list was initialized outside of the critical lock. If changes occurred before the lock was finally secured, the initialized list could become stale, causing potential metadata inconsistencies. This issue has been resolved; the system now safely re-reads the snapshot list from the OM checkpoint database immediately after acquiring the lock.

Apache JIRA: HDDS-13772

CDPD-94576: Configuration property ozone.om.snapshot.db.max.open.files rejects -1 as a valid value

Previously, the validation logic for the ozone.om.snapshot.db.max.open.files configuration property incorrectly rejected -1 as an invalid value. This restriction prevented administrators from setting the value to -1, which instructs the underlying RocksDB instance to keep all file descriptors open to avoid expensive table cache calls and optimize performance. This issue has been resolved, and the validation logic now properly accepts -1.

Apache JIRA: HDDS-13473

CDPD-92655: Spurious WARN log messages emitted by RocksDBCheckpointDiffer compaction tracker

Previously, the RocksDBCheckpointDiffer compaction tracker emitted spurious WARN log messages when a snapshot was not present at the beginning of a compaction job. This issue has been resolved. The compaction tracker now monitors the compaction job ID and only issues a warning if a snapshot was explicitly verified to be present when the compaction process started.

Apache JIRA: HDDS-13863

CDPD-89578: Ozone Manager tracks snapshot SST files from the active database, causing incorrect snapshot diffs

Previously, during Ozone Manager (OM) snapshot initialization, the LocalSnapshotMetadata process incorrectly tracked SST files from the active object store database instead of the isolated snapshot checkpoint. If a background compaction was committed immediately after the checkpoint was created, this mismatch resulted in incorrect snapshot diff data. This issue has been resolved by ensuring that snapshot metadata generation strictly tracks files directly from the snapshot checkpoint.

Apache JIRA: HDDS-13628

CDPD-61603: Temporary files accumulate on the host after Ozone Manager bootstrapping

Previously, temporary files generated during the Ozone Manager (OM) bootstrapping process (located in the /tmp directory) were not automatically deleted after the process completed. This caused unnecessary disk space consumption and file accumulation over time on the OM host. This issue has been resolved, and these temporary files are now automatically cleaned up upon successful bootstrap completion.

Apache JIRA: HDDS-9337

CDPD-94559: Insufficient locking during snapshot checkpoints causes potential data inconsistencies

Previously, the Ozone Manager (OM) only acquired an individual snapshot lock during the final stage of copying snapshot data. Because the full snapshot cache lock was not held while taking the active database checkpoint during a snapshot export, concurrent operations could introduce data inconsistencies. This issue has been resolved by ensuring that the comprehensive snapshot cache lock is safely acquired before the active database checkpoint is initiated.

Apache JIRA: HDDS-13768

CDPD-89287: Snapshot creation log messages lack bucket and volume context

Previously, log messages generated during Ozone snapshot creation only included the metadata directory and the snapshot name. This limited context made it difficult for system administrators to identify or trace specific snapshots within the cluster logs. This issue has been resolved, and these log messages have been enhanced to explicitly include both the bucket and volume names for better traceability.

Apache JIRA: HDDS-13604

CDPD-98374: Incomplete Erasure Coded (EC) pipelines are cached, causing persistent "insufficient DataNodes" errors

Previously, the Ozone Manager (OM) container pipeline cache could store incomplete Erasure Coded (EC) pipelines if certain DataNodes had not yet reported their status. This resulted in false "insufficient datanodes" errors when clients attempted to read EC files. Because these incomplete pipelines remained cached for up to 6 hours, the read failures would persist long after the physical DataNodes had returned to a normal, operational state. This issue has been resolved by ensuring that incomplete EC pipelines are excluded from the container pipeline cache.

Apache JIRA: HDDS-11209

CDPD-101434: Risk of missing blocks in Ratis pipelines due to initial under-replication

Previously, when data was written through a Ratis pipeline, blocks were occasionally written only to the leader DataNode, despite the system returning a successful write acknowledgment to the client. Consequently, if the leader DataNode failed or went offline before the data was replicated to other nodes, the associated blocks or chunk data went missing, resulting in temporary data unavailability. This vulnerable state persisted until the Storage Container Manager (SCM) background processes detected the under-replicated container and completed the proper replication. This issue is fixed now.

Apache JIRA: HDDS-15052