Fixed issues in Ozone

Review the list of Ozone issues that are resolved in Cloudera Runtime 7.3.2, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.2

Cloudera Runtime 7.3.2 resolves Ozone issues and incorporates fixes from the service packs and cumulative hotfixes from 7.3.1.100 through 7.3.1.706. For a comprehensive record of all fixes in Cloudera Runtime 7.3.1.x, see Fixed Issues.

CDPD-80567: Snapshot garbage collection issues
7.3.2
Resolved multiple issues that prevented the snapshot garbage collection system from collecting garbage. This release improves the efficiency and reliability of snapshot garbage collection, resulting in better performance and ensuring that storage is reclaimed in a timely manner.
Apache JIRA: HDDS-12558
CDPD-84361: The KeyDeletingService failed when the key size exceeded the Ratis buffer size
7.3.2
Previously, when the KeyDeletingService was fetching keys to be deleted based on keyLimitPerTask, the deletion operation failed if the key size exceeded the Ratis buffer limit (default 32 MB). This issue is fixed now and the key size does not depend on Ratis buffer size.
Apache JIRA: HDDS-13213
CDPD-80739: Ozone Recon - Containers page displayed an incorrect label for the count of blocks in various unhealthy container states
7.3.2
Previously, the Ozone Recon UI showed the Number of Keys label instead of the Number of Blocks for various unhealthy states of containers.. This issue has been fixed now and the labels are displayed correctly.
Apache JIRA: HDDS-12588
CDPD-84620: The Ozone Recon unable to serve request and throws 500 error ServiceNotReadyException on /keys/open
7.3.2
Previously, Ozone Recon would return an HTTP 500 error with a ServiceNotReadyException when the /keys/open API was called while the NSSummary tree was being rebuilt or was temporarily inconsistent. This issue is fixed now.
Apache Jira: HDDS-13763
CDPD-87883: The processed_keys_metrics table failed to updatr when converting deleted keys
7.3.2
Previously, when the Ozone tiering workflow attempted to convert the deleted keys, the conversion failed as expected; however, the failure details were not recorded in the processed_keys_metrics table. This occurred because deleted keys lacked required fields, such as replication type, replication factor, and so on. This issue is fixed now and the processed_keys_metrics table is getting updated correctly.
CDPD-69122: Ozone Manager database checkpoint generation failure
7.3.2
Previously, the Ozone Manager database checkpoint generation failed due to an InterruptedException Unable to process metadata snapshot request during the parallel snapshot operations and cluster restarts. This issue is fixed now.
Apache JIRA: HDDS-10739
CDPD-92017: Set ozone.om.group.rights configuration default value to ALL
7.3.2
Previously, the ozone.om.group.rights configuration value READ, LIST could not be processed by lower version of Ozone. This issue is now fixed by setting the default value of the ozone.om.group.rights configuration to ALL.
CDPD-75981: Default native ACL limits to user and user's primary group
7.3.2
Previously, the default native ACL created for an object such as volume, bucket, or file limits to the object owner and owner's primary group. If Ranger was enabled, those ACLs did not take effect, but they get saved into KeyInfo regardless. This issue is fixed now.
Apache JIRA: HDDS-11656
CDPD-87831: Storage Container Manager (SCM) schedules replications on full DataNodes by failing to track pending operation size
7.3.2
Previously, Storage Container Manager (SCM) scheduled replication commands to fix under-replication or misreplication for container moves, decommissioning, and other operations for both Radis and EC containers. SCM checked whether a target DataNode had space equal to twice the container size value before selecting it as the target node for container replication. However, SCM did not account for the pending operation size of the scheduled tasks. Consequently, SCM could over-schedule replications to a target DataNode that did not have enough space. This issue is now fixed.
Apache JIRA: HDDS-13437
CDPD-80178: Missing check for space availability for all DNS while container creation is in pipeline
7.3.2
Previously, if the leader node in the pipeline did not have the capacity to create a new container, it might have returned a container creation failure. If the follower node did not have the capacity to create a new container, it might have failed and repeatedly attempted to find another follower node. This behavior could cause excessive disk space consumption by parallel write blocks through a state machine, resulting in slower write performance and delayed failure responses. This issue is now fixed by checking whether a DataNode has enough space for a new container before allocating one. This improves write performance and reduces container creation failure in scenarios when DataNodes have less than 5GB disk space remaining.
Apache JIRA: HDDS-12468
CDPD-87749: No logs are available about on-demand scan triggering
7.3.2
Previously, there were no logs or debug information about why the on-demand scans were triggering on the containers. This issue is fixed now, and logs are available specifying the reason for on-demand container scans.
Apache JIRA: HDDS-13423
CDPD-85250: The OzoneTokenIdentifier does not serialize or deserialize correctly
7.3.2
Previously, the null omServiceId was de-serialized as an empty string,which can cause the delegation token cleanup issues in RocksDB. This issue is fixed now.
Apache JIRA: HDDS-13264
CDPD-82295: AWS S3 DeleteObject failures for FSO bucket keys containing special characters
7.3.2
Previously, AWS S3 DeleteObject could fail for FSO bucket keys containing special characters. This issue has been fixed now by removing name validation during deletion.
Apache JIRA: HDDS-12911
CDPD-74686: DirectoryDeletion task ignored through ratis
7.3.2
Previously, directory deletion task was ignored through Ratis, and there was deletion retry instead of actually deleting it. This issue is now resolved.
Apache JIRA: HDDS-11491
CDPD-74685: Directory deletion get stuck having millions of directory
7.3.2
Previously, directory deletion cleanup at background failed if there were huge number of empty directories (millions of empty directories), whose combined metadata size crossed the Ratis allowed request size for deletion. This issue is now resolved.
Apache JIRA: HDDS-11492
CDPD-87270: Secret key premature expiration and invalidation
7.3.2
Resolved the secret key issue and adjusted the secret key expiry calculation (hdds.secret.key.expiry.duration = 9d) to account for delegation token lifetimes. This change ensures that tokens remain valid for their full configured duration, preventing premature key invalidation and improving stable authentication.
Apache JIRA: HDDS-13343
CDPD-76523: ozone debug ldb --with-keys key defaults to false instead of true
7.3.2
Previously, the ozone debug ldb --with-keys option defaulted to false instead of true when specified without a value and did not print the keys. This issue is fixed now, and the ozone debug ldb option ozone debug ldb --with-keys option now defaults to true when specified without a value so it will includes keys in the output by default.
Apache JIRA: HDDS-11782
CDPD-84609: The --output-dir option was removed from replicas verify command
7.3.2
Previously, the Ozone debug replicas verify option --output-dir was not being used for replicas verify command. In future writing the output into a specified file/ will be supported, so instead of removing the command it has to be made optional.. This issue is fixed now and the --output-dir option is made optional for replicas verify command and is no longer a required field.
Apache JIRA: HDDS-13248
CDPD-76520: DataNode aborts if hdds.datanode.wait.on.all.followers = true
7.3.2
Previously, the DataNode aborted if the hdds.datanode.wait.on.all.followers configuration was set to true. This issue is fixed and the DataNode no longer aborts if the hdds.datanode.wait.on.all.followers configuration is set to true.
Apache JIRA: HDDS-11785
CDPD-76501: DataNode Ratis is taking snapshots frequently
7.3.2
Previously, DataNode Ratis was taking snapshots every 5 to 8 seconds causing overhead. This issue is fixed now and the hdds.ratis.snapshot.threshold and hdds.container.ratis.statemachine.max.pending.apply-transactions configuration limits are increased to 100k to avoid frequent DataNode Ratis snapshotting.
Apache JIRA: HDDS-11773
CDPD-75112: HBase RegionServer crashes due to inconsistency caused by Ozone client failover handling
7.3.2
Previously, the HBase RegionServer crashes due to inconsistency caused by Ozone client failover handling. This issue is fixed now by making the Ozone Manager client retry idempotent so that the clients do not crash due to inconsistent results.
Apache JIRA: HDDS-11558
CDPD-77938: Local Refresh button for current selected path is missing in the new Ozone Recon UI
7.3.2
Previously, refreshing the Recon UI page would reset the current path selection and return users to the root directory, causing loss of context and requiring manual navigation. This issue is fixed now and a new Path Reload button in the new Recon UI for the Namespace page.
Apache JIRA: HDDS-12085
CDPD-77728: Calendar disappears while setting custom date range in the Heatmap page in New Recon UI
7.3.2
Previously, setting the custom date range in the Heatmap page of the new Recon UI would cause the calendar widget to close unexpectedly. Specifically, when you clicked the back arrow to navigate to a previous month in the datepicker, the entire calendar and dropdown would disappear, preventing date selection. This issue is fixed, and the calendar remains visible until a date is selected and confirmed, allowing users to set custom date ranges as intended.
Apache JIRA: HDDS-12044
CDPD-77356: Recon UI displayed identical and duplicate values for Quota Allowed and Quota In Bytes
7.3.2
Previously, in the Ozone Recon UI, the Quota Allowed and Quota In Bytes fields incorrectly displayed the same value. This duplication prevented users from accurately distinguishing between the allocated quota and the actual disk space consumed. This issue is fixed now and the Recon UI displays the values correctly.
Apache JIRA: HDDS-11987
CDPD-74437: Multiple IOzoneAuthorizer instances may be created during Ratis snapshot installation failures
7.3.2
Previously, if a failure occurred during the installation of a Ratis snapshot after the metadata manager was stopped, multiple instances of the Ozone authorizer could be created and retained in memory. This led to excessive heap usage and, in some cases, crashes due to long garbage collection pauses, especially in environments with Ranger+Ozone integration. The issue is fixed, and the old authorizer instances are properly cleaned up, preventing heap exhaustion.
Apache JIRA: HDDS-11472
CDPD-92003: Container Size Count Task showing empty in new Recon UI
7.3.2
Previously, in the Ozone Recon UI, the Container Size Count Task page appeared empty when accessed through the new user interface. This issue is fixed now.
Apache JIRA: HDDS-13821
CDPD-88628: Ozone Recon Overview page does not load until all APIs are loaded
7.3.2
Previously, the Recon Overview page waited for all API calls to complete before displaying any results, causing delays and poor responsiveness. This issue is fixed now, and each card on the Overview page now loads independently as soon as its corresponding API call resolves. This change improves overall page responsiveness and ensures that API errors only affect the relevant cards, rather than preventing the entire page from loading.
Apache JIRA: HDDS-13542
CDPD-88541: Namespace Usage page becomes blank when Recon DB is missing
7.3.2
Previously, the Namespace Usage page could appear blank if the Recon DB was missing during a fresh installation. This issue is fixed now.
Apache JIRA: HDDS-13528
CDPD-88383: Accessing the new Ozone Recon UI through Knox breaks the UI
7.3.2
Previously, accessing the new Ozone Recon UI through a reverse proxy such as Knox caused the UI to break. This issue is fixed now.
Apache JIRA: HDDS-13512
CDPD-56281: Ozone Manager database updates were blocked while Recon was reprocessing all Recon tasks
7.3.2
Previously, when Recon was reprocessing all Recon tasks, Ozone Manager database updates were blocked, which could cause repeated full snapshots and impact performance. This issue has been fixed by allowing Ozone Manager database updates to proceed concurrently with Recon task processing, preventing unnecessary full snapshots and improving system efficiency.
Apache JIRA: HDDS-8633
CDPD-77805: Improper error handling in the NSSummaryTask
7.3.2
Previously, improper error handling in the NSSummaryTask could lead to data inconsistencies in the Ozone Recon. This issue is fixed now, and ensures robust error handling in Ozone Recon.
Apache JIRA: HDDS-12062
CDPD-80826: Ozone Recon failures during the bootstrapping process
7.3.2
Previously, Ozone Recon failures during the bootstrapping process were not handle properly. This issue is fixed, and if any Ozone Manager (OM) task fails during bootstrapping, Recon will now correctly handle and reprocess failed OM tasks to ensure successful bootstrapping. Additionally, if a partial or corrupted OM database tarball is received, Recon will clean up the corrupted file and restart the fetch process from scratch, maintaining data consistency and integrity.
Apache JIRA: HDDS-12615
CDPD-76226: The Recon ListKeys API did not return an appropriate HTTP response
7.3.2
Previously, the Recon ListKeys API did not return an appropriate HTTP response when an NSSummary rebuild was in progress. With this fix, the API now returns HTTP status code 503 (Service Unavailable) to indicate that the service is temporarily unavailable due to the ongoing NSSummary rebuild. This allows clients to properly handle the too busy or try again later scenario.
Apache JIRA: HDDS-11708
CDPD-76248: The default volume choosing policy is not updated correctly in the ozone-default.xml
7.3.2
Previously, the default volume choosing policy was updated from RoundRobinVolumeChoosingPolicy to CapacityVolumeChoosingPolicy. However the ozone-default.xml is not updated correctly and still considers the RoundRobinVolumeChoosingPolicy as the default volume choosing policy. But, the RoundRobinVolumeChoosingPolicy did not consider available space on volumes when creating or replicating containers, which could result in allocate block failures (though retried) or the creation of small containers. This issue is fixed now, and the default volume choosing policy has been changed from RoundRobinVolumeChoosingPolicy to CapacityVolumeChoosingPolicy in the ozone-default.xml. This ensures that available capacity is now taken into account during container allocation, improving reliability and resource utilization.
Apache JIRA: HDDS-11735
CDPD-73809: Multithreading issue in the ContainerBalancerTask
7.3.2
Previously, the concurrent access to shared data structures in the getCurrentIterationsStatistic method could cause unpredictable errors. This issue is fixed now, and ensures thread safety of each access inside the getCurrentIterationsStatistic method by synchronizing access to the iterationsStatistic list and using ConcurrentHashMap for concurrent access to maps from findTargetStrategy and findSourceStrategy.
Apache JIRA: HDDS-11386
CDPD-88723: The FSORepairTool fails to distinguish Unreachable and Unreferenced objects
7.3.2
Previously, the FSORepairTool logic to distinguish between Unreachable and Unreferenced objects was incorrect. This issue is fixed now, and the logic has been corrected. The unreachable objects are not marked for repair as background cleanup processes will eventually handle them, while objects that are neither reachable nor unreachable are classified as unreferenced and marked for repair.
Apache JIRA: HDDS-13549
CDPD-87575: The ozone admin container create command runs forever without kinit
7.3.2
Previously, the ozone admin container create command would run indefinitely on secure Ozone clusters with multiple Storage Container Manager (SCM) nodes if authentication failed (for example, when kinit was not performed). This issue is now fixed, and the retry logic has been updated to fail fast on authentication exceptions, providing immediate feedback to the user instead of hanging. This issue was observed only in SCM HA cluster configurations.
Apache JIRA: HDDS-13405
CDPD-90362: Container Balancer stop command fails with an error as already stopped
7.3.2
Previously, the stopBalancer command for the Ozone Container Balancer would fail with an error if the balancer was already stopped, instead of returning a successful response. This issue is now fixed. The stopBalancer operation is now idempotent and will return success if the balancer is already stopped.
Additionally, a race condition was resolved where, during an SCM leadership change, the balancer could unintentionally restart due to the persisted state not being updated. Now, the system ensures that the balancer's stopped state is correctly persisted, preventing unintended restarts during leadership transitions.
Apache JIRA: HDDS-13694
CDPD-89400: DataNode pipeline closes frequently
7.3.2
Previously, when the DataNode (DN) Ratis identified issues with a pipeline—such as a slow follower, prolonged leader election, or disk failures—it repeatedly triggered Close Pipeline actions, even if a close action was already pending in the DN command queue. This could result in excessive close actions being queued on every heartbeat, leading to inefficiency and potential command queue bloat. The issue is now fixed and introduces a check to ensure that a Close Pipeline action for a specific pipeline is not added to the command queue if one is already pending, thereby preventing redundant triggers and optimizing the signaling mechanism.
Apache JIRA: HDDS-13618
CDPD-80991: Non-administrative users could attempt to perform Ozone Manager (OM) decommission
7.3.2
Previously, non-administrative users could attempt to perform OM decommission, which could lead to unauthorized or unintended changes. This issue has now been fixed, and only users with administrative privileges are authorized to perform OM decommission actions, enhancing the security and integrity of cluster management.
Apache JIRA: HDDS-12646