Fixed issues in Ozone

CDPD-84457: Recon logs can be flooded by Negative usedBytes WARN messages in large Ozone clusters

7.3.2

Previously, in Ozone Recon, frequent “Negative usedBytes … treating it as 0” messages were logged at the WARN level and could flood Recon logs in large clusters. This issue has been fixed now.

Apache JIRA: HDDS-13220

CDPD-80567: Snapshot garbage collection fails to reclaim storage

7.3.2

Previously, multiple issues prevented the snapshot garbage collection system from identifying and removing deleted data. This issue in now resolved. Improvements to the efficiency and reliability of snapshot garbage collection process ensure that storage is reclaimed in a timely manner, resulting in better overall performance.

Apache JIRA: HDDS-12558

CDPD-84361: KeyDeletingService fails when the key size exceeds Ratis buffer

7.3.2

Previously, when the KeyDeletingService was fetching keys to be deleted based on keyLimitPerTask, the deletion operation failed if the key size exceeded the Ratis buffer limit (default 32 MB). This issue is now fixed. The key deletion operations no longer depend on the Ratis buffer size.

Apache JIRA: HDDS-13213

CDPD-80739: Ozone Recon - Containers page displays incorrect labels for unhealthy containers

7.3.2

Previously, the Ozone Recon UI incorrectly displayed the Number of Keys label instead of the Number of Blocks label for containers in various unhealthy states. This issue is now fixed. The labels now display the correct information.

Apache JIRA: HDDS-12588

CDPD-84620: Ozone Recon returns 500 error ServiceNotReadyException on /keys/open during NSSummary tree rebuild

7.3.2

Previously, Ozone Recon returned an HTTP 500 error with a ServiceNotReadyException when the /keys/open API was called while the NSSummary tree was being rebuilt or was temporarily inconsistent. This issue is now fixed.

Apache Jira: HDDS-13763

CDPD-87883: The processed_keys_metrics table fails to update when converting deleted keys

7.3.2

Previously, the processed_keys_metrics table failed to record details when the Ozone tiering workflow attempted to convert deleted keys. This occurred because deleted keys lacked required fields, such as replication type or replication factor,. This issue is now fixed, and the processed_keys_metrics table updates correctly.

CDPD-69122: Ozone Manager database checkpoint generation failure

7.3.2

Previously, the Ozone Manager database checkpoint generation failed due to an

InterruptedException Unable
                            to process metadata snapshot request

during the parallel snapshot operations or cluster restarts. This issue is now fixed.

Apache JIRA: HDDS-10739

CDPD-92017: Lower Ozone versions cannot process ozone.om.group.rights default value

7.3.2

Previously, lower versions of Ozone could not process the ozone.om.group.rights configuration when it was set to READ, LIST. This issue is now fixed by setting the default value to ALL.

CDPD-75981: Default native ACL limits to user and user's primary group

7.3.2

Previously, the default native ACLs for an object, such as volume, bucket, or file, limited to the object owner and owner's primary group. If Ranger was enabled, these ACLs did not take effect, but were saved to KeyInfo regardless. This issue is now fixed.

Apache JIRA: HDDS-11656

CDPD-87831: SCM over-schedules replications to full DataNodes

7.3.2

Previously, Storage Container Manager (SCM) scheduled replication commands to fix under-replication or mis-replication for container moves, decommissioning, and other operations for both Radis and EC containers. SCM checked whether a target DataNode had space equal to twice the container size value before selecting it as the target node for container replication. However, SCM did not account for the pending operation size of the scheduled tasks. Consequently, SCM could over-schedule replications to a target DataNode that did not have enough space. This issue is now fixed.

Apache JIRA: HDDS-13437

CDPD-80178: Missing check for space availability for all DNS while container creation is in pipeline

7.3.2

Previously, if the leader node in the pipeline did not have the capacity to create a new container, it might have returned a container creation failure. If the follower node did not have the capacity to create a new container, it might have failed and repeatedly attempted to find another follower node. This behavior could cause excessive disk space consumption by parallel write blocks through a state machine, resulting in slower write performance and delayed failure responses. This issue is now fixed by checking whether a DataNode has enough space for a new container before allocating one. This improves write performance and reduces container creation failure in scenarios when DataNodes have less than 5GB disk space remaining.

Apache JIRA: HDDS-12468

CDPD-87749: No logs are available about on-demand scan triggering

7.3.2

Previously, no logs or debug information existed to explain why on-demand scans were triggered on the containers. This issue is now fixed, and logs are available specifying the reason for on-demand container scans.

Apache JIRA: HDDS-13423

CDPD-85250: The OzoneTokenIdentifier does not serialize or deserialize correctly

7.3.2

Previously, a null omServiceId was deserialized as an empty string, which caused delegation token cleanup issues in RocksDB. This issue is now fixed w.

Apache JIRA: HDDS-13264

CDPD-82295: AWS S3 DeleteObject failures for FSO bucket keys containing special characters

7.3.2

Previously, AWS S3 DeleteObject could fail for File System Optimized (FSO) bucket keys containing special characters. This issue is now fixed by removing name validation during deletion.

Apache JIRA: HDDS-12911

CDPD-74686: DirectoryDeletion task ignored by Ratis

7.3.2

Previously, directory deletion tasks were ignored by Ratis, leading to repeated deletion retries instead of actual deletion. This issue is now resolved.

Apache JIRA: HDDS-11491

CDPD-74685: Directory deletion fails having millions of directory

7.3.2

Previously, background directory deletion cleanup failed when attempted to delete millions of empty directories because their combined metadata size exceeded the allowed Ratis request size. This issue is now resolved.

Apache JIRA: HDDS-11492

CDPD-87270: Secret key premature expiration and invalidation

7.3.2

Previously, secret keys could expire before the end of a delegation token lifetime causing premature authentication failures. This issue is now fixed. The secret key expiry calculation (hdds.secret.key.expiry.duration) is adjusted to 9 days. This ensures that tokens remain valid for their full configured duration to improve stable authentication.

Apache JIRA: HDDS-13343

CDPD-76523:

ozone debug ldb
                            --with-keys

key defaults to false instead of true

7.3.2

Previously, the

ozone debug ldb
                            --with-keys

option defaulted to false when specified without a value and did not print the keys. This issue is now fixed. The option defaults to true when specified without a value and includes keys in the output by default.

Apache JIRA: HDDS-11782

CDPD-84609: The --output-dir option is unavailable for replicas verify command

7.3.2

Previously, the Ozone debug replicas verify command did not support the --output-dir option. This issue is now fixed. The --output-dir option is now an optional field for the replicas verify command.

Apache JIRA: HDDS-13248

CDPD-76520: DataNode aborts if hdds.datanode.wait.on.all.followers = true

7.3.2

Previously, the DataNode aborted if the hdds.datanode.wait.on.all.followers configuration was set to true. This issue is now fixed.

Apache JIRA: HDDS-11785

CDPD-76501: DataNode Ratis is taking snapshots frequently

7.3.2

Previously, DataNode Ratis was taking snapshots every 5 to 8 seconds causing overhead. This issue is now fixed. The hdds.ratis.snapshot.threshold and hdds.container.ratis.statemachine.max.pending.apply-transactions configuration limits are increased to 100k to avoid taking frequent DataNode Ratis snapshots.

Apache JIRA: HDDS-11773

CDPD-75112: HBase RegionServer crashes due to inconsistency caused by Ozone client failover handling

7.3.2

Previously, the HBase RegionServer crashed due to inconsistencies caused by Ozone client failover handling. This issue is now fixed by making the Ozone Manager client retry idempotent which prevents the client from crashing when encountering inconsistent results.

Apache JIRA: HDDS-11558

CDPD-77938: Local Refresh button for current selected path is missing in the new Ozone Recon UI

7.3.2

Previously, refreshing the Recon UI page reset the current path selection and returned to the root directory, causing loss of context and requiring manual navigation. This issue is now fixed. The new Path Reload button is introduced in the new Recon UI for the Namespace page.

Apache JIRA: HDDS-12085

CDPD-77728: Calendar disappears while setting custom date range in the Heatmap page in New Recon UI

7.3.2

Previously, setting the custom date range in the Heatmap page of the new Recon UI caused the calendar widget to close unexpectedly. Specifically, clicking the back arrow to navigate to a previous month in the date picker, caused the entire calendar and the drop-down menu to disappear, preventing date selection. This issue is fixed, and the calendar remains visible until a date is selected and confirmed, allowing users to set custom date ranges as intended.

Apache JIRA: HDDS-12044

CDPD-77356: Recon UI displayed identical and duplicate values for Quota Allowed and Quota In Bytes

7.3.2

Previously, in the Ozone Recon UI, the Quota Allowed and Quota In Bytes fields incorrectly displayed the same value. This duplication prevented you from accurately distinguishing between the allocated quota and the actual consumed disk space. This issue is now fixed, and the Recon UI displays the values correctly.

Apache JIRA: HDDS-11987

CDPD-74437: Multiple IOzoneAuthorizer instances might be created during Ratis snapshot installation failures

7.3.2

Previously, if a failure occurred during the installation of a Ratis snapshot after the metadata manager was stopped, multiple instances of the Ozone authorizer could be created and retained in memory. This led to excessive heap usage and, in some cases, crashes due to long garbage collection pauses, especially in environments with Ranger and Ozone integration. The issue is now fixed, and the old authorizer instances are properly cleaned up, preventing heap exhaustion.

Apache JIRA: HDDS-11472

CDPD-92003: Container Size Count Task showing empty in new Recon UI

7.3.2

Previously, in the Ozone Recon UI, the Container Size Count Task page was displayed empty when accessed through the new user interface. This issue is now fixed.

Apache JIRA: HDDS-13821

CDPD-88628: Ozone Recon Overview page does not load until all APIs are loaded

7.3.2

Previously, the Recon Overview page waited for all API calls to complete before displaying any results, causing delays and poor responsiveness. This issue is now fixed, and each card on the Overview page now loads independently as soon as its corresponding API call resolves. This change improves overall page responsiveness and ensures that API errors only affect the relevant cards, rather than preventing the entire page from loading.

Apache JIRA: HDDS-13542

CDPD-88541: Namespace Usage page becomes blank when Recon DB is missing

7.3.2

Previously, the Namespace Usage page could appear blank if the Recon DB was missing during a fresh installation. This issue is now fixed.

Apache JIRA: HDDS-13528

CDPD-88383: Accessing the new Ozone Recon UI through Knox breaks the UI

7.3.2

Previously, accessing the new Ozone Recon UI through a reverse proxy such as Knox caused the UI to break. This issue is now fixed.

Apache JIRA: HDDS-13512

CDPD-56281: Ozone Manager database updates are blocked while Recon is reprocessing all Recon tasks

7.3.2

Previously, when Recon was reprocessing all Recon tasks, Ozone Manager database updates were blocked, which could cause repeated full snapshots and impact performance. This issue is now fixed by allowing Ozone Manager database updates to proceed concurrently with Recon task processing, preventing unnecessary full snapshots and improving system efficiency.

Apache JIRA: HDDS-8633

CDPD-77805: Improper error handling in the NSSummaryTask

7.3.2

Previously, improper error handling in the NSSummaryTask could lead to data inconsistencies in the Ozone Recon. This issue is now fixed, and ensures robust error handling in Ozone Recon.

Apache JIRA: HDDS-12062

CDPD-80826: Ozone Recon fails during the bootstrapping process

7.3.2

Previously, Ozone Recon did not properly handle failures that occurred during the bootstrapping process. This issue is now fixed. If an Ozone Manager (OM) task fails during bootstrapping, Recon now correctly handles and reprocesses the task to ensure a successful start. Additionally, if Recon receives a partial or corrupted OM database tarball, it cleans up the corrupted file and restarts the fetch process from scratch to maintain data consistency and integrity.

Apache JIRA: HDDS-12615

CDPD-76226: The Recon ListKeys API returns an inappropriate HTTP response

7.3.2

Previously, the Recon ListKeys API did not return an appropriate HTTP response when an NSSummary rebuild was in progress. This issue is now fixed. The API now returns the 503 (Service Unavailable) HTTP status code to indicate that the service is temporarily unavailable due to the ongoing NSSummary rebuild. This allows clients to properly handle the too busy or

try again
                            later

scenario.

Apache JIRA: HDDS-11708

CDPD-76248: The default volume choosing policy is not updated correctly in the ozone-default.xml

7.3.2

Previously, the ozone-default.xml file incorrectly listed the RoundRobinVolumeChoosingPolicy as the default volume choosing policy.This policy did not consider available volume space during container creation or replication, which could result in block allocation failures (though retried) or the creation of small containers. This issue is now fixed. The default volume choosing policy is changed to CapacityVolumeChoosingPolicy in the ozone-default.xml file. This ensures that available capacity is now taken into account during container allocation, improving reliability and resource utilization.

Apache JIRA: HDDS-11735

CDPD-73809: Multithreading issues in the ContainerBalancerTask

7.3.2

Previously, the concurrent access to shared data structures in the getCurrentIterationsStatistic method could cause unpredictable errors. This issue is now fixed. Inside the getCurrentIterationsStatistic method, the system now ensures thread safety by synchronizing access to the iterationsStatistic list and using ConcurrentHashMap for concurrent access to maps from findTargetStrategy and findSourceStrategy.

Apache JIRA: HDDS-11386

CDPD-88723: The FSORepairTool fails to distinguish Unreachable and Unreferenced objects

7.3.2

Previously, the FSORepairTool logic to distinguish between Unreachable and Unreferenced objects was incorrect. This issue is now fixed, and the logic is corrected. The unreachable objects are not marked for repair as background cleanup processes will eventually handle them, while objects that are neither reachable nor unreachable are classified as unreferenced and marked for repair.

Apache JIRA: HDDS-13549

CDPD-87575: The ozone admin container create command runs forever without kinit

7.3.2

Previously, the ozone admin container create command ran indefinitely on secure Ozone clusters with multiple SCM nodes if authentication failed, for example, when kinit was not performed. This issue was specifically observed in SCM HA cluster configurations. This issue is now fixed, and the retry logic is updated to fail fast on authentication exceptions, providing immediate feedback to you instead of hanging.

Apache JIRA: HDDS-13405

CDPD-90362: Container Balancer stop command fails with an error

7.3.2

Previously, the stopBalancer command for the Ozone Container Balancer failed with an error if the balancer was already stopped, instead of returning a successful response. This issue is now fixed. The stopBalancer operation is now idempotent and will return success if the balancer is already stopped.

Additionally, a race condition during an SCM leadership change caused the balancer to restart unintentionally due to the persisted state not being updated. This issue is also now resolved. The system correctly persists the stopped state of the balancer, preventing unintended restarts during leadership transitions.

Apache JIRA: HDDS-13694

CDPD-89400: DataNode pipeline closes frequently

7.3.2

Previously, the DataNode (DN) Ratis repeatedly triggered Close Pipeline actions when it identified issues with a pipeline, such as a slow follower, prolonged leader election, or disk failures, even if a close action was already pending in the DN command queue. This could result in excessive close actions being queued on every heartbeat, leading to inefficiency and potential command queue bloat. The issue is now fixed. A check is introduced to ensure that a

Close
                            Pipeline

action for a specific pipeline is not added to the command queue if one is already pending, preventing redundant triggers and optimizing the signaling mechanism.

Apache JIRA: HDDS-13618

CDPD-80991: Non-administrative users could attempt to perform OM decommission

7.3.2

Previously, non-administrative users could attempt to perform OM decommission, which could lead to unauthorized or unintended changes. This issue is now fixed. Only users with administrative privileges are authorized to perform OM decommission actions, enhancing the security and integrity of cluster management.

Apache JIRA: HDDS-12646

Fixed issues in Ozone

Cloudera Runtime 7.3.2