Fixed issues and resolved maintenance items for Ozone are addressed in Cloudera Runtime 7.3.2, its service packs and cumulative
hotfixes.
Cloudera Runtime 7.3.2
Cloudera Runtime 7.3.2 resolves Ozone issues and incorporates fixes
from the service packs and cumulative hotfixes from 7.3.1.100 through 7.3.1.706. For
a comprehensive record of all fixes in Cloudera Runtime 7.3.1.x,
see Fixed Issues.
- CDPD-84457: Recon logs can be flooded by Negative
usedBytes WARN messages in large Ozone clusters
- 7.3.2
- Previously, in Ozone Recon, frequent
“
Negative usedBytes … treating it as 0” messages were
logged at the WARN level and could flood Recon logs in large clusters. This
issue has been fixed now.
- Apache JIRA:
HDDS-13220
- CDPD-80567: Snapshot garbage collection fails to
reclaim storage
- 7.3.2
- Previously, multiple issues prevented the
snapshot garbage collection system from identifying and removing deleted
data. This issue in now resolved. Improvements to the efficiency and
reliability of snapshot garbage collection process ensure that storage is
reclaimed in a timely manner, resulting in better overall performance.
- Apache JIRA:
HDDS-12558
- CDPD-84361:
KeyDeletingService fails
when the key size exceeds Ratis buffer
- 7.3.2
- Previously, when the
KeyDeletingService was fetching keys to be deleted
based on keyLimitPerTask, the deletion operation failed if
the key size exceeded the Ratis buffer limit (default 32 MB). This issue is
now fixed. The key deletion operations no longer depend on the Ratis buffer
size.
- Apache JIRA:
HDDS-13213
- CDPD-80739: Ozone Recon - Containers page displays
incorrect labels for unhealthy containers
- 7.3.2
- Previously, the Ozone Recon UI incorrectly
displayed the Number of Keys label instead of the
Number of Blocks label for containers in various
unhealthy states. This issue is now fixed. The labels now display the
correct information.
- Apache JIRA:
HDDS-12588
- CDPD-84620: Ozone Recon returns 500 error
ServiceNotReadyException on /keys/open
during NSSummary tree rebuild
- 7.3.2
- Previously, Ozone Recon returned an HTTP 500
error with a
ServiceNotReadyException when the
/keys/open API was called while the
NSSummary tree was being rebuilt or was temporarily
inconsistent. This issue is now fixed.
- Apache Jira:
HDDS-13763
- CDPD-87883: The
processed_keys_metrics table fails to update when
converting deleted keys
- 7.3.2
- Previously, the
processed_keys_metrics table failed to record details
when the Ozone tiering workflow attempted to convert deleted keys. This
occurred because deleted keys lacked required fields, such as replication
type or replication factor,. This issue is now fixed, and the
processed_keys_metrics table updates correctly.
- CDPD-69122: Ozone Manager database checkpoint
generation failure
- 7.3.2
- Previously, the Ozone Manager database
checkpoint generation failed due to an
InterruptedException Unable
to process metadata snapshot request during the parallel
snapshot operations or cluster restarts. This issue is now fixed.
- Apache JIRA:
HDDS-10739
- CDPD-92017: Lower Ozone versions cannot process
ozone.om.group.rights default value
- 7.3.2
- Previously, lower versions of Ozone could not
process the
ozone.om.group.rights configuration when it was
set to READ, LIST. This issue is now fixed by setting the
default value to ALL.
- CDPD-75981: Default native ACL limits to user and
user's primary group
- 7.3.2
- Previously, the default native ACLs for an
object, such as volume, bucket, or file, limited to the object owner and
owner's primary group. If Ranger was enabled, these ACLs did not take
effect, but were saved to KeyInfo regardless. This issue is now fixed.
- Apache JIRA:
HDDS-11656
- CDPD-87831: SCM over-schedules replications to full
DataNodes
- 7.3.2
- Previously, Storage Container Manager (SCM)
scheduled replication commands to fix under-replication or mis-replication
for container moves, decommissioning, and other operations for both Radis
and EC containers. SCM checked whether a target DataNode had space equal to
twice the container size value before selecting it as the target node for
container replication. However, SCM did not account for the pending
operation size of the scheduled tasks. Consequently, SCM could over-schedule
replications to a target DataNode that did not have enough space. This issue
is now fixed.
- Apache JIRA:
HDDS-13437
- CDPD-80178: Missing check for space availability for
all DNS while container creation is in pipeline
- 7.3.2
- Previously, if the leader node in the pipeline
did not have the capacity to create a new container, it might have returned
a container creation failure. If the follower node did not have the capacity
to create a new container, it might have failed and repeatedly attempted to
find another follower node. This behavior could cause excessive disk space
consumption by parallel write blocks through a state machine, resulting in
slower write performance and delayed failure responses. This issue is now
fixed by checking whether a DataNode has enough space for a new container
before allocating one. This improves write performance and reduces container
creation failure in scenarios when DataNodes have less than 5GB disk space
remaining.
- Apache JIRA:
HDDS-12468
- CDPD-87749: No logs are available about on-demand scan
triggering
- 7.3.2
- Previously, no logs or debug information
existed to explain why on-demand scans were triggered on the containers.
This issue is now fixed, and logs are available specifying the reason for
on-demand container scans.
- Apache JIRA:
HDDS-13423
- CDPD-85250: The OzoneTokenIdentifier does not
serialize or deserialize correctly
- 7.3.2
- Previously, a
null omServiceId
was deserialized as an empty string, which caused delegation token cleanup
issues in RocksDB. This issue is now fixed w.
- Apache JIRA:
HDDS-13264
- CDPD-82295: AWS S3 DeleteObject failures for FSO
bucket keys containing special characters
- 7.3.2
- Previously, AWS S3 DeleteObject could fail for
File System Optimized (FSO) bucket keys containing special characters. This
issue is now fixed by removing name validation during deletion.
- Apache JIRA:
HDDS-12911
- CDPD-74686: DirectoryDeletion task ignored by Ratis
- 7.3.2
- Previously, directory deletion tasks were
ignored by Ratis, leading to repeated deletion retries instead of actual
deletion. This issue is now resolved.
- Apache JIRA:
HDDS-11491
- CDPD-74685: Directory deletion fails having millions
of directory
- 7.3.2
- Previously, background directory deletion
cleanup failed when attempted to delete millions of empty directories
because their combined metadata size exceeded the allowed Ratis request
size. This issue is now resolved.
- Apache JIRA:
HDDS-11492
- CDPD-87270: Secret key premature expiration and
invalidation
- 7.3.2
- Previously, secret keys could expire before the
end of a delegation token lifetime causing premature authentication
failures. This issue is now fixed. The secret key expiry calculation
(
hdds.secret.key.expiry.duration) is adjusted to 9
days. This ensures that tokens remain valid for their full configured
duration to improve stable authentication.
- Apache JIRA:
HDDS-13343
- CDPD-76523:
ozone debug ldb
--with-keys key defaults to false instead of
true
- 7.3.2
- Previously, the
ozone debug ldb
--with-keys option defaulted to false when
specified without a value and did not print the keys. This issue is now
fixed. The option defaults to true when specified without a
value and includes keys in the output by default.
- Apache JIRA:
HDDS-11782
- CDPD-84609: The
--output-dir option
is unavailable for replicas verify command
- 7.3.2
- Previously, the Ozone debug replicas
verify command did not support the
--output-dir option. This issue is now fixed. The
--output-dir option is now an optional field for the
replicas verify command.
- Apache JIRA:
HDDS-13248
- CDPD-76520: DataNode aborts if
hdds.datanode.wait.on.all.followers = true
- 7.3.2
- Previously, the DataNode aborted if the
hdds.datanode.wait.on.all.followers configuration
was set to
true. This issue is now fixed.
- Apache JIRA:
HDDS-11785
- CDPD-76501: DataNode Ratis is taking snapshots
frequently
- 7.3.2
- Previously, DataNode Ratis was taking snapshots
every 5 to 8 seconds causing overhead. This issue is now fixed. The
hdds.ratis.snapshot.threshold and
hdds.container.ratis.statemachine.max.pending.apply-transactions
configuration limits are increased to
100k to avoid taking
frequent DataNode Ratis snapshots.
- Apache JIRA:
HDDS-11773
- CDPD-75112: HBase RegionServer crashes due to
inconsistency caused by Ozone client failover handling
- 7.3.2
- Previously, the HBase RegionServer crashed due
to inconsistencies caused by Ozone client failover handling. This issue is
now fixed by making the Ozone Manager client retry idempotent which prevents
the client from crashing when encountering inconsistent results.
- Apache JIRA:
HDDS-11558
- CDPD-77938: Local Refresh button for current selected
path is missing in the new Ozone Recon UI
- 7.3.2
- Previously, refreshing the Recon UI page reset
the current path selection and returned to the root directory, causing loss
of context and requiring manual navigation. This issue is now fixed. The new
Path Reload button is introduced in the new Recon
UI for the Namespace page.
- Apache JIRA:
HDDS-12085
- CDPD-77728: Calendar disappears while setting custom
date range in the Heatmap page in New Recon UI
- 7.3.2
- Previously, setting the custom date range in
the Heatmap page of the new Recon UI caused the
calendar widget to close unexpectedly. Specifically, clicking the back arrow
to navigate to a previous month in the date picker, caused the entire
calendar and the drop-down menu to disappear, preventing date selection.
This issue is fixed, and the calendar remains visible until a date is
selected and confirmed, allowing users to set custom date ranges as
intended.
- Apache JIRA:
HDDS-12044
- CDPD-77356: Recon UI displayed identical and duplicate
values for Quota Allowed and Quota In
Bytes
- 7.3.2
- Previously, in the Ozone Recon UI, the
Quota Allowed and Quota In
Bytes fields incorrectly displayed the same value. This
duplication prevented you from accurately distinguishing between the
allocated quota and the actual consumed disk space. This issue is now fixed,
and the Recon UI displays the values correctly.
- Apache JIRA:
HDDS-11987
- CDPD-74437: Multiple IOzoneAuthorizer instances might
be created during Ratis snapshot installation failures
- 7.3.2
- Previously, if a failure occurred during the
installation of a Ratis snapshot after the metadata manager was stopped,
multiple instances of the Ozone authorizer could be created and retained in
memory. This led to excessive heap usage and, in some cases, crashes due to
long garbage collection pauses, especially in environments with Ranger and
Ozone integration. The issue is now fixed, and the old authorizer instances
are properly cleaned up, preventing heap exhaustion.
- Apache JIRA:
HDDS-11472
- CDPD-92003: Container Size Count
Task showing empty in new Recon UI
- 7.3.2
- Previously, in the Ozone Recon UI, the
Container Size Count Task page was displayed empty
when accessed through the new user interface. This issue is now fixed.
- Apache JIRA:
HDDS-13821
- CDPD-88628: Ozone Recon Overview
page does not load until all APIs are loaded
- 7.3.2
- Previously, the Recon
Overview page waited for all API calls to complete
before displaying any results, causing delays and poor responsiveness. This
issue is now fixed, and each card on the Overview page
now loads independently as soon as its corresponding API call resolves. This
change improves overall page responsiveness and ensures that API errors only
affect the relevant cards, rather than preventing the entire page from
loading.
- Apache JIRA:
HDDS-13542
- CDPD-88541: Namespace Usage page
becomes blank when Recon DB is missing
- 7.3.2
- Previously, the Namespace
Usage page could appear blank if the Recon DB was missing
during a fresh installation. This issue is now fixed.
- Apache JIRA:
HDDS-13528
- CDPD-88383: Accessing the new Ozone Recon UI through
Knox breaks the UI
- 7.3.2
- Previously, accessing the new Ozone Recon UI
through a reverse proxy such as Knox caused the UI to break. This issue is
now fixed.
- Apache JIRA:
HDDS-13512
- CDPD-56281: Ozone Manager database updates are blocked
while Recon is reprocessing all Recon tasks
- 7.3.2
- Previously, when Recon was reprocessing all
Recon tasks, Ozone Manager database updates were blocked, which could cause
repeated full snapshots and impact performance. This issue is now fixed by
allowing Ozone Manager database updates to proceed concurrently with Recon
task processing, preventing unnecessary full snapshots and improving system
efficiency.
- Apache JIRA:
HDDS-8633
- CDPD-77805: Improper error handling in the
NSSummaryTask
- 7.3.2
- Previously, improper error handling in the
NSSummaryTask could lead to data inconsistencies in the
Ozone Recon. This issue is now fixed, and ensures robust error handling in
Ozone Recon.
- Apache JIRA:
HDDS-12062
- CDPD-80826: Ozone Recon fails during the bootstrapping
process
- 7.3.2
- Previously, Ozone Recon did not properly handle
failures that occurred during the bootstrapping process. This issue is now
fixed. If an Ozone Manager (OM) task fails during bootstrapping, Recon now
correctly handles and reprocesses the task to ensure a successful start.
Additionally, if Recon receives a partial or corrupted OM database tarball,
it cleans up the corrupted file and restarts the fetch process from scratch
to maintain data consistency and integrity.
- Apache JIRA:
HDDS-12615
- CDPD-76226: The Recon ListKeys API returns an
inappropriate HTTP response
- 7.3.2
- Previously, the Recon
ListKeys API did not return an appropriate HTTP
response when an
NSSummary rebuild was in progress. This
issue is now fixed. The API now returns the 503 (Service Unavailable) HTTP
status code to indicate that the service is temporarily unavailable due to
the ongoing NSSummary rebuild. This allows clients to
properly handle the too busy or try again
later scenario.
- Apache JIRA:
HDDS-11708
- CDPD-76248: The default volume choosing policy is not
updated correctly in the ozone-default.xml
- 7.3.2
- Previously, the
ozone-default.xml file incorrectly listed the
RoundRobinVolumeChoosingPolicy as the default volume
choosing policy.This policy did not consider available volume space during
container creation or replication, which could result in block allocation
failures (though retried) or the creation of small containers. This issue is
now fixed. The default volume choosing policy is changed to
CapacityVolumeChoosingPolicy in the
ozone-default.xml file. This ensures that available
capacity is now taken into account during container allocation, improving
reliability and resource utilization.
- Apache JIRA:
HDDS-11735
- CDPD-73809: Multithreading issues in the
ContainerBalancerTask
- 7.3.2
- Previously, the concurrent access to shared
data structures in the
getCurrentIterationsStatistic method
could cause unpredictable errors. This issue is now fixed. Inside the
getCurrentIterationsStatistic method, the system now
ensures thread safety by synchronizing access to the
iterationsStatistic list and using
ConcurrentHashMap for concurrent access to maps from
findTargetStrategy and
findSourceStrategy.
- Apache JIRA:
HDDS-11386
- CDPD-88723: The FSORepairTool fails to distinguish
Unreachable and Unreferenced
objects
- 7.3.2
- Previously, the FSORepairTool logic to
distinguish between
Unreachable and
Unreferenced objects was incorrect. This issue is now
fixed, and the logic is corrected. The unreachable objects are not marked
for repair as background cleanup processes will eventually handle them,
while objects that are neither reachable nor unreachable are classified as
unreferenced and marked for repair.
- Apache JIRA:
HDDS-13549
- CDPD-87575: The ozone admin container
create command runs forever without kinit
- 7.3.2
- Previously, the ozone admin container
create command ran indefinitely on secure Ozone clusters with
multiple SCM nodes if authentication failed, for example, when kinit was not
performed. This issue was specifically observed in SCM HA cluster
configurations. This issue is now fixed, and the retry logic is updated to
fail fast on authentication exceptions, providing immediate feedback to you
instead of hanging.
- Apache JIRA:
HDDS-13405
- CDPD-90362: Container Balancer stop command fails with
an error
- 7.3.2
- Previously, the stopBalancer
command for the Ozone Container Balancer failed with an error if the
balancer was already stopped, instead of returning a successful response.
This issue is now fixed. The stopBalancer operation is
now idempotent and will return success if the balancer is already
stopped.
- Additionally, a race condition during an SCM
leadership change caused the balancer to restart unintentionally due to the
persisted state not being updated. This issue is also now resolved. The
system correctly persists the stopped state of the balancer, preventing
unintended restarts during leadership transitions.
- Apache JIRA:
HDDS-13694
- CDPD-89400: DataNode pipeline closes frequently
- 7.3.2
- Previously, the DataNode (DN) Ratis repeatedly
triggered
Close Pipeline actions when it identified issues
with a pipeline, such as a slow follower, prolonged leader election, or disk
failures, even if a close action was already pending in the DN command
queue. This could result in excessive close actions being queued on every
heartbeat, leading to inefficiency and potential command queue bloat. The
issue is now fixed. A check is introduced to ensure that a Close
Pipeline action for a specific pipeline is not added to the
command queue if one is already pending, preventing redundant triggers and
optimizing the signaling mechanism.
- Apache JIRA:
HDDS-13618
- CDPD-80991: Non-administrative users could attempt to
perform OM decommission
- 7.3.2
- Previously, non-administrative users could
attempt to perform OM decommission, which could lead to unauthorized or
unintended changes. This issue is now fixed. Only users with administrative
privileges are authorized to perform OM decommission actions, enhancing the
security and integrity of cluster management.
- Apache JIRA:
HDDS-12646