Review the list of Ozone issues that are resolved in Cloudera Runtime 7.3.2, its service packs and cumulative hotfixes.
Cloudera Runtime 7.3.2
Cloudera Runtime 7.3.2 resolves Ozone issues and incorporates fixes
from the service packs and cumulative hotfixes from 7.3.1.100 through 7.3.1.706. For
a comprehensive record of all fixes in Cloudera Runtime 7.3.1.x,
see Fixed Issues.
- CDPD-80567: Snapshot garbage collection issues
- 7.3.2
- Resolved multiple issues that prevented the
snapshot garbage collection system from collecting garbage. This release
improves the efficiency and reliability of snapshot garbage collection,
resulting in better performance and ensuring that storage is reclaimed in a
timely manner.
- Apache JIRA:
HDDS-12558
- CDPD-84361: The
KeyDeletingService
failed when the key size exceeded the Ratis buffer size
- 7.3.2
- Previously, when the
KeyDeletingService was fetching keys to be deleted
based on keyLimitPerTask, the deletion operation failed if the key size
exceeded the Ratis buffer limit (default 32 MB). This issue is fixed now and
the key size does not depend on Ratis buffer size.
- Apache JIRA:
HDDS-13213
- CDPD-80739: Ozone Recon - Containers page displayed an
incorrect label for the count of blocks in various unhealthy container
states
- 7.3.2
- Previously, the Ozone Recon UI showed the
Number of Keys label instead of the
Number of Blocks for various unhealthy states of
containers.. This issue has been fixed now and the labels are displayed
correctly.
- Apache JIRA:
HDDS-12588
- CDPD-84620: The Ozone Recon unable to serve request
and throws 500 error
ServiceNotReadyException on
/keys/open
- 7.3.2
- Previously, Ozone Recon would return an HTTP
500 error with a
ServiceNotReadyException when the
/keys/open API was called while the NSSummary tree was
being rebuilt or was temporarily inconsistent. This issue is fixed now.
- Apache Jira:
HDDS-13763
- CDPD-87883: The
processed_keys_metrics table failed to updatr when
converting deleted keys
- 7.3.2
- Previously, when the Ozone tiering workflow
attempted to convert the deleted keys, the conversion failed as expected;
however, the failure details were not recorded in the
processed_keys_metrics table. This occurred because
deleted keys lacked required fields, such as replication type, replication
factor, and so on. This issue is fixed now and the
processed_keys_metrics table is getting updated
correctly.
- CDPD-69122: Ozone Manager database checkpoint
generation failure
- 7.3.2
- Previously, the Ozone Manager database
checkpoint generation failed due to an InterruptedException
Unable
to process metadata snapshot request during the parallel
snapshot operations and cluster restarts. This issue is fixed now.
- Apache JIRA:
HDDS-10739
- CDPD-92017: Set
ozone.om.group.rights
configuration default value to ALL
- 7.3.2
- Previously, the
ozone.om.group.rights configuration value READ,
LIST could not be processed by lower version of Ozone. This
issue is now fixed by setting the default value of the
ozone.om.group.rights configuration to
ALL.
- CDPD-75981: Default native ACL limits to user and
user's primary group
- 7.3.2
- Previously, the default native ACL created for
an object such as volume, bucket, or file limits to the object owner and
owner's primary group. If Ranger was enabled, those ACLs did not take
effect, but they get saved into KeyInfo regardless. This issue is fixed
now.
- Apache JIRA:
HDDS-11656
- CDPD-87831: Storage Container Manager (SCM) schedules
replications on full DataNodes by failing to track pending operation size
- 7.3.2
- Previously, Storage Container Manager (SCM)
scheduled replication commands to fix under-replication or misreplication
for container moves, decommissioning, and other operations for both Radis
and EC containers. SCM checked whether a target DataNode had space equal to
twice the container size value before selecting it as the target node for
container replication. However, SCM did not account for the pending
operation size of the scheduled tasks. Consequently, SCM could over-schedule
replications to a target DataNode that did not have enough space. This issue
is now fixed.
- Apache JIRA:
HDDS-13437
- CDPD-80178: Missing check for space availability for
all DNS while container creation is in pipeline
- 7.3.2
- Previously, if the leader node in the pipeline
did not have the capacity to create a new container, it might have returned
a container creation failure. If the follower node did not have the capacity
to create a new container, it might have failed and repeatedly attempted to
find another follower node. This behavior could cause excessive disk space
consumption by parallel write blocks through a state machine, resulting in
slower write performance and delayed failure responses. This issue is now
fixed by checking whether a DataNode has enough space for a new container
before allocating one. This improves write performance and reduces container
creation failure in scenarios when DataNodes have less than 5GB disk space
remaining.
- Apache JIRA:
HDDS-12468
- CDPD-87749: No logs are available about on-demand scan
triggering
- 7.3.2
- Previously, there were no logs or debug
information about why the on-demand scans were triggering on the containers.
This issue is fixed now, and logs are available specifying the reason for
on-demand container scans.
- Apache JIRA:
HDDS-13423
- CDPD-85250: The OzoneTokenIdentifier does not
serialize or deserialize correctly
- 7.3.2
- Previously, the
null
omServiceId was de-serialized as an empty string,which can
cause the delegation token cleanup issues in RocksDB. This issue is fixed
now.
- Apache JIRA:
HDDS-13264
- CDPD-82295: AWS S3 DeleteObject failures for FSO
bucket keys containing special characters
- 7.3.2
- Previously, AWS S3 DeleteObject could fail for
FSO bucket keys containing special characters. This issue has been fixed now
by removing name validation during deletion.
- Apache JIRA:
HDDS-12911
- CDPD-74686: DirectoryDeletion task ignored through
ratis
- 7.3.2
- Previously, directory deletion task was ignored
through Ratis, and there was deletion retry instead of actually deleting it.
This issue is now resolved.
- Apache JIRA:
HDDS-11491
- CDPD-74685: Directory deletion get stuck having
millions of directory
- 7.3.2
- Previously, directory deletion cleanup at
background failed if there were huge number of empty directories (millions
of empty directories), whose combined metadata size crossed the Ratis
allowed request size for deletion. This issue is now resolved.
- Apache JIRA:
HDDS-11492
- CDPD-87270: Secret key premature expiration and
invalidation
- 7.3.2
- Resolved the secret key issue and adjusted the
secret key expiry calculation (hdds.secret.key.expiry.duration = 9d) to
account for delegation token lifetimes. This change ensures that tokens
remain valid for their full configured duration, preventing premature key
invalidation and improving stable authentication.
- Apache JIRA:
HDDS-13343
- CDPD-76523:
ozone debug ldb
--with-keys key defaults to false instead of
true
- 7.3.2
- Previously, the
ozone debug ldb
--with-keys option defaulted to false instead
of true when specified without a value and did not print
the keys. This issue is fixed now, and the ozone debug ldb option
ozone debug ldb --with-keys option now defaults to
true when specified without a value so it will includes
keys in the output by default.
- Apache JIRA:
HDDS-11782
- CDPD-84609: The
--output-dir option
was removed from replicas verify command
- 7.3.2
- Previously, the Ozone debug replicas verify
option
--output-dir was not being used for replicas verify
command. In future writing the output into a specified
file/ will be supported, so instead of removing the
command it has to be made optional.. This issue is fixed now and the
--output-dir option is made optional for replicas
verify command and is no longer a required field.
- Apache JIRA:
HDDS-13248
- CDPD-76520: DataNode aborts if
hdds.datanode.wait.on.all.followers = true
- 7.3.2
- Previously, the DataNode aborted if the
hdds.datanode.wait.on.all.followers configuration
was set to
true. This issue is fixed and the DataNode no
longer aborts if the
hdds.datanode.wait.on.all.followers configuration
is set to true.
- Apache JIRA:
HDDS-11785
- CDPD-76501: DataNode Ratis is taking snapshots
frequently
- 7.3.2
- Previously, DataNode Ratis was taking snapshots
every 5 to 8 seconds causing overhead. This issue is fixed now and the
hdds.ratis.snapshot.threshold and
hdds.container.ratis.statemachine.max.pending.apply-transactions
configuration limits are increased to
100k to avoid
frequent DataNode Ratis snapshotting.
- Apache JIRA:
HDDS-11773
- CDPD-75112: HBase RegionServer crashes due to
inconsistency caused by Ozone client failover handling
- 7.3.2
- Previously, the HBase RegionServer crashes due
to inconsistency caused by Ozone client failover handling. This issue is
fixed now by making the Ozone Manager client retry idempotent so that the
clients do not crash due to inconsistent results.
- Apache JIRA:
HDDS-11558
- CDPD-77938: Local Refresh button for current selected
path is missing in the new Ozone Recon UI
- 7.3.2
- Previously, refreshing the Recon UI page would
reset the current path selection and return users to the root directory,
causing loss of context and requiring manual navigation. This issue is fixed
now and a new Path Reload button in the new Recon UI
for the Namespace page.
- Apache JIRA:
HDDS-12085
- CDPD-77728: Calendar disappears while setting custom
date range in the Heatmap page in New Recon UI
- 7.3.2
- Previously, setting the custom date range in
the Heatmap page of the new Recon UI would cause the calendar widget to
close unexpectedly. Specifically, when you clicked the back arrow to
navigate to a previous month in the datepicker, the entire calendar and
dropdown would disappear, preventing date selection. This issue is fixed,
and the calendar remains visible until a date is selected and confirmed,
allowing users to set custom date ranges as intended.
- Apache JIRA:
HDDS-12044
- CDPD-77356: Recon UI displayed identical and duplicate
values for Quota Allowed and Quota In Bytes
- 7.3.2
- Previously, in the Ozone Recon UI, the
Quota Allowed and Quota In
Bytes fields incorrectly displayed the same value. This
duplication prevented users from accurately distinguishing between the
allocated quota and the actual disk space consumed. This issue is fixed now
and the Recon UI displays the values correctly.
- Apache JIRA:
HDDS-11987
- CDPD-74437: Multiple IOzoneAuthorizer instances may be
created during Ratis snapshot installation failures
- 7.3.2
- Previously, if a failure occurred during the
installation of a Ratis snapshot after the metadata manager was stopped,
multiple instances of the Ozone authorizer could be created and retained in
memory. This led to excessive heap usage and, in some cases, crashes due to
long garbage collection pauses, especially in environments with Ranger+Ozone
integration. The issue is fixed, and the old authorizer instances are
properly cleaned up, preventing heap exhaustion.
- Apache JIRA:
HDDS-11472
- CDPD-92003: Container Size Count Task showing empty in
new Recon UI
- 7.3.2
- Previously, in the Ozone Recon UI, the
Container Size Count Task page appeared empty when accessed through the new
user interface. This issue is fixed now.
- Apache JIRA:
HDDS-13821
- CDPD-88628: Ozone Recon Overview page does not load
until all APIs are loaded
- 7.3.2
- Previously, the Recon Overview page waited for
all API calls to complete before displaying any results, causing delays and
poor responsiveness. This issue is fixed now, and each card on the Overview
page now loads independently as soon as its corresponding API call resolves.
This change improves overall page responsiveness and ensures that API errors
only affect the relevant cards, rather than preventing the entire page from
loading.
- Apache JIRA:
HDDS-13542
- CDPD-88541: Namespace Usage page becomes blank when
Recon DB is missing
- 7.3.2
- Previously, the Namespace Usage page could
appear blank if the Recon DB was missing during a fresh installation. This
issue is fixed now.
- Apache JIRA:
HDDS-13528
- CDPD-88383: Accessing the new Ozone Recon UI through
Knox breaks the UI
- 7.3.2
- Previously, accessing the new Ozone Recon UI
through a reverse proxy such as Knox caused the UI to break. This issue is
fixed now.
- Apache JIRA:
HDDS-13512
- CDPD-56281: Ozone Manager database updates were
blocked while Recon was reprocessing all Recon tasks
- 7.3.2
- Previously, when Recon was reprocessing all
Recon tasks, Ozone Manager database updates were blocked, which could cause
repeated full snapshots and impact performance. This issue has been fixed by
allowing Ozone Manager database updates to proceed concurrently with Recon
task processing, preventing unnecessary full snapshots and improving system
efficiency.
- Apache JIRA:
HDDS-8633
- CDPD-77805: Improper error handling in the
NSSummaryTask
- 7.3.2
- Previously, improper error handling in the
NSSummaryTask could lead to data inconsistencies in the Ozone Recon. This
issue is fixed now, and ensures robust error handling in Ozone Recon.
- Apache JIRA:
HDDS-12062
- CDPD-80826: Ozone Recon failures during the
bootstrapping process
- 7.3.2
- Previously, Ozone Recon failures during the
bootstrapping process were not handle properly. This issue is fixed, and if
any Ozone Manager (OM) task fails during bootstrapping, Recon will now
correctly handle and reprocess failed OM tasks to ensure successful
bootstrapping. Additionally, if a partial or corrupted OM database tarball
is received, Recon will clean up the corrupted file and restart the fetch
process from scratch, maintaining data consistency and integrity.
- Apache JIRA:
HDDS-12615
- CDPD-76226: The Recon ListKeys API did not return an
appropriate HTTP response
- 7.3.2
- Previously, the Recon ListKeys API did not
return an appropriate HTTP response when an NSSummary rebuild was in
progress. With this fix, the API now returns HTTP status code 503 (Service
Unavailable) to indicate that the service is temporarily unavailable due to
the ongoing NSSummary rebuild. This allows clients to properly handle the
too busy or try again later
scenario.
- Apache JIRA:
HDDS-11708
- CDPD-76248: The default volume choosing policy is not
updated correctly in the ozone-default.xml
- 7.3.2
- Previously, the default volume choosing policy
was updated from
RoundRobinVolumeChoosingPolicy to
CapacityVolumeChoosingPolicy. However the
ozone-default.xml is not updated correctly and
still considers the RoundRobinVolumeChoosingPolicy as the
default volume choosing policy. But, the
RoundRobinVolumeChoosingPolicy did not consider
available space on volumes when creating or replicating containers, which
could result in allocate block failures (though retried) or the creation of
small containers. This issue is fixed now, and the default volume choosing
policy has been changed from RoundRobinVolumeChoosingPolicy
to CapacityVolumeChoosingPolicy in the
ozone-default.xml. This ensures that available
capacity is now taken into account during container allocation, improving
reliability and resource utilization.
- Apache JIRA:
HDDS-11735
- CDPD-73809: Multithreading issue in the
ContainerBalancerTask
- 7.3.2
- Previously, the concurrent access to shared
data structures in the
getCurrentIterationsStatistic method
could cause unpredictable errors. This issue is fixed now, and ensures
thread safety of each access inside the
getCurrentIterationsStatistic method by synchronizing
access to the iterationsStatistic list and using
ConcurrentHashMap for concurrent access to maps from
findTargetStrategy and
findSourceStrategy.
- Apache JIRA:
HDDS-11386
- CDPD-88723: The FSORepairTool fails to distinguish
Unreachable and Unreferenced
objects
- 7.3.2
- Previously, the FSORepairTool logic to
distinguish between
Unreachable and
Unreferenced objects was incorrect. This issue is fixed
now, and the logic has been corrected. The unreachable objects are not
marked for repair as background cleanup processes will eventually handle
them, while objects that are neither reachable nor unreachable are
classified as unreferenced and marked for repair.
- Apache JIRA:
HDDS-13549
- CDPD-87575: The ozone admin container
create command runs forever without kinit
- 7.3.2
- Previously, the ozone admin container
create command would run indefinitely on secure Ozone clusters
with multiple Storage Container Manager (SCM) nodes if authentication failed
(for example, when kinit was not performed). This issue is now fixed, and
the retry logic has been updated to fail fast on authentication exceptions,
providing immediate feedback to the user instead of hanging. This issue was
observed only in SCM HA cluster configurations.
- Apache JIRA:
HDDS-13405
- CDPD-90362: Container Balancer stop command fails with
an error as already stopped
- 7.3.2
- Previously, the stopBalancer
command for the Ozone Container Balancer would fail with an error if the
balancer was already stopped, instead of returning a successful response.
This issue is now fixed. The stopBalancer operation is
now idempotent and will return success if the balancer is already
stopped.
- Additionally, a race condition was resolved
where, during an SCM leadership change, the balancer could unintentionally
restart due to the persisted state not being updated. Now, the system
ensures that the balancer's stopped state is correctly persisted, preventing
unintended restarts during leadership transitions.
- Apache JIRA:
HDDS-13694
- CDPD-89400: DataNode pipeline closes frequently
- 7.3.2
- Previously, when the DataNode (DN) Ratis
identified issues with a pipeline—such as a slow follower, prolonged leader
election, or disk failures—it repeatedly triggered
Close
Pipeline actions, even if a close action was already pending in
the DN command queue. This could result in excessive close actions being
queued on every heartbeat, leading to inefficiency and potential command
queue bloat. The issue is now fixed and introduces a check to ensure that a
Close Pipeline action for a specific pipeline is not
added to the command queue if one is already pending, thereby preventing
redundant triggers and optimizing the signaling mechanism.
- Apache JIRA:
HDDS-13618
- CDPD-80991: Non-administrative users could attempt to
perform Ozone Manager (OM) decommission
- 7.3.2
- Previously, non-administrative users could
attempt to perform OM decommission, which could lead to unauthorized or
unintended changes. This issue has now been fixed, and only users with
administrative privileges are authorized to perform OM decommission actions,
enhancing the security and integrity of cluster management.
- Apache JIRA:
HDDS-12646