Fixed issues in Ozone

Review the list of Ozone issues that are resolved in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.1.400 SP2

CDPD-82201: OMKeyAclRequestWithFSO is incorrectly setting full path as key name
7.3.1.400
When you set, add, or remove an ACL for a FSO bucket, the key name gets corrupted with the full key path. This fix ensures the correct key name is set during the ACL calls.

Apache Jira: HDDS-12891

CDPD-81939: Volume scanner should fail volume if rocksDB is inaccessible
7.3.1.400
When RocksDB becomes unreadable on a DataNode due to disk-related issues, the DataNode will mark the affected storage volume as unhealthy. This proactive health marking enables the system to initiate data replication processes more rapidly, thereby maintaining data availability and integrity.

Apache Jira: HDDS-12723

CDPD-78932: Container replication should be atomic
7.3.1.400
During container replication, the destination node imports the container from the source node. If any issues are encountered during the import process, the Datanode is responsible for gracefully cleaning up any residual or stale container metadata to maintain system integrity.

Apache Jira: HDDS-12233

CDPD-73278: Update OM, SCM, Datanode conf for RATIS-2135
7.3.1.400
Set raft.grpc.message.size.max to be 1MB larger than raft.server.log.appender.buffer.byte-limit for OM, SCM and Datanode.

Apache Jira: HDDS-11320

CDPD-57559: New Ozone Manager leader cannot verify the Ozone delegation token signed by old Ozone Manager leader
7.3.1.400
If an Ozone cluster is upgraded and then later downgraded, then the new Ozone Manager after downgrade cannot verify the new Ozone delegation tokens issued before downgrade causing the clients which are still running during this upgrade download period to fail. If there is no downgrade, then everything is fine.

This issue is fixed and this fix changes the Ozone delegation token sign from using asymmetric key to symmetric key.

Apache Jira: HDDS-8829

CDPD-70409: Recon Overview Page UI fails to load if Recon Solr Health throws error
7.3.1.400
This fixes an issue where the Recon UI failed to load if the Solr Health check API threw an error.
CDPD-80742: ConstraintViolationException was crashing the ContainerHealthTask in Ozone Recon
7.3.1.400
ConstraintVoilationException was crashing the ContainerHealthTask in Recon. After this fix, task will not crash and continue to identify the Unhealthy containers in SCM if any.

Apache Jira: HDDS-12585

Cloudera Runtime 7.3.1.300 SP1 CHF 1

CDPD-80823: Snapshot creation is removing extra keys from the Active Object Storage's DB
7.3.1.300
Wrong keys were trapped in the DeletedTable of the snapshot if the OBS bucket name is a prefix for another OBS bucket, resulting in orphaned blocks. After the fix, it will not remove any extra keys from the DeletedTable for Active Object storage.

Apache Jira: HDDS-12611

CDPD-73375: Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly
7.3.1.300
Prometheus sink already have a mechanism to publish metrics every 10 seconds by default using call back with timer event. So, we removed the code to publish immediately and this issue is fixed.

Apache Jira: HDDS-12193

CDPD-78671: Metric timer task is blocking installSnapshotFromLeader on follower node
7.3.1.300
ozone.om.snapshot.rocksdb.metrics.enabled is now available in Ozone to disable metric collection on the snapshotted DB if necessary.

Apache Jira: HDDS-11339

CDPD-78781: Tarball creation interfering with snapshot purge
7.3.1.300
Synchronize SnapshotDeletingService on BootstrapStateHandler.Lock to make sure that no background service is running when Tarball creation is in progress.

Apache Jira: HDDS-12210

Cloudera Runtime 7.3.1.200 SP1

CDPD-74556: EC Checksum throws IllegalArgumentException because the buffer limit is negative
7.3.1.200
When ozone.client.bytes.per.checksum is set to a lower value (for example, 16kb), the parity checksum calculation during validation phase is wrong leading to IllegalArgumentException on client. This is now fixed.

Apache Jira: HDDS-11482

CDPD-75981: Default native ACL limits to user and user's primary group
7.3.1.200
Default native ACL created for an object such as volume, bucket, or file limits to the object owner and owner's primary group.

Apache Jira: HDDS-11656

CDPD-72782: Ozone write does not work when http proxy is set for the JVM
7.3.1.200
GRPC uses HTTP internally for its connections and due to this, if HTTP proxy is configured for any Ozone process using GRPC , it directs each call through the proxy even for GRPC which is not desirable for performance. Hence this fix disables proxy for GRPC connections that Ozone uses.

Apache Jira: HDDS-11257

CDPD-65714: Allow FS client to specify EC as default filesystem replication
7.3.1.200
This fix allows you to specify EC as the default replication type for a file uploaded through the Hadoop FileSystem API to Ozone through a client side configuration option.

Apache Jira: HDDS-10336

Cloudera Runtime 7.3.1.100 CHF 1

There are no fixed issues in this release.

Cloudera Runtime 7.3.1

OPSAPS-71474: In Cloudera Manager UI, the Ozone service Snapshot tab displays label label.goToBucket and it must be changed to Go to bucket.
7.3.1
This issue is now resolved.
OPSAPS-70288: Improvements in master node decommissioning.
7.3.1
This issue is now resolved by making usability and functional improvements to the Ozone master node decommissioning.
CDPD-74756: Update Ratis to 3.1.1
7.3.1
Updated Ratis dependency version from 3.1.0 to 3.1.1.

Apache Jira: HDDS-11504

CDPD-74241: OmSnapshotPurge should be in a different Ozone manager double buffer batch.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-11453

CDPD-74200: Recon UI shows incorrect data about volume, bucket, and keys. Recon is unable to sync its data with OM DB.
7.3.1
This issue is now resolved.
CDPD-74074: The /v1/triggerdbsync/om api is working with non-admin user even if security is enabled.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-11436

CDPD-73775: Replace solr.version with solr_lkgb_jar_version for Ozone to use downstream version of Solr.
7.3.1
Replaced pom solr.version with solr_lkgb_jar_version for Ozone to use downstream version of Solr.
CDPD-73447: Incorrect number of deleted containers shown in Recon UI.
7.3.1
The addition of the EMPTY_MISSING state to the retainOrUpdateRecord method enables Recon to correctly identify and manage the lifecycle of these containers, ensuring that no stale or deleted containers ramin in memory or in Recon's records.
CDPD-73330: The namespace quota and namespace dist commands fail and displays the Path not found in the system error for the existing volume or bucket.
7.3.1
Fixed the Ozone admin namespace summary.

Apache Jira: HDDS-10581

CDPD-72142: Keys from DeletedTable and DeletedDirTable of Active Object Store (AOS) should be deleted on batch operation while creating a Snapshot.
7.3.1
On snapshot creation, DeletedTable and DeletedDirTable of AOS is cleared. This operation is not performed in the same transaction as Snapshot create which can cause an orphan block objects in case of bootstrapping and lagging follower. This issue is now resolved and Snapshot creation and clearing of the DeletedTableand DeletedDirTable on AOS will be a single batch operation.

Apache Jira: HDDS-11183

CDPD-72076: The OMDoubleBuffer error is displayed when handling OMRequest: cmdType: SnapshotMoveDeletedKeys.
7.3.1
This fixes OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in the one transaction.

Apache Jira: HDDS-11152

CDPD-72019: Remove the locks from SnapshotPurge and SnapshotSetProperty APIs.
7.3.1
This fixes the OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in one transaction.

Apache Jira: HDDS-11137

CDPD-71702: Ozone Manager is down to Snapshot Chain Corruption.
7.3.1
SSTFilteringService directly updates the snapshotInfoTable which can cause the snapshot chain corruption if OM crashes before DB gets flushed for snapshot purge and SSTFilteringService has updated the next snapshot in the chain.

Apache Jira: HDDS-11068

CDPD-71584: Ozone Recon DecomissioningInfo API throws displays the NPE error.
7.3.1
This issue is resolved by fixing the NullPointerException when running DecomissioningInfo API.

Apache Jira: HDDS-11045

CDPD-71502: Ozone Recon - Decommissioned datanodes show up even after removing it from the Recon Datanodes page.
7.3.1
Recon previously allowed to remove the Decommissioned datanodes and was removing from Recon rocksDB nodes table. However, Decommissioned datanodes continue to send heartbeats till they are being shutdown. This gets registered and added again in the Recon memory map which makes them show up again in datanodes UI. This issue is now resolved and allows only decommissioned datanodes to be removed and skip other node status or node operational status datanodes.

Apache Jira: HDDS-11032

CDPD-70469: Ozone Recon - Handle startup failure and log reasons as error because SCM non-HA is enabled.
7.3.1
This issue is now resolved by fixing the Recon startup failure when SCM runs in non-HA mode.

Apache Jira: HDDS-10937

CDPD-68912: Ozone Recon - Improve Recon startup failure handling.
This issue is now resolved. Recon should recover from Cloudera Runtime or unexpected failures during startup and provide information on Recon UI. Recon can fail to start due to several reasons:
  • Failure of registering of datanodes or invalid topology.
  • Initialization of pipelines.

Apache Jira: HDDS-10702

CDPD-67668: Ozone Recon - Provide DN decommissioning detailed status and information inline with current CLI command output.
7.3.1
This issue resolved by adding a new improvement to provide API in Recon for DN decommissioning. Status and information is now inline with current CLI command output.

Apache Jira: HDDS-10514

CDPD-67460: Container Balancer should only move containers with size greater than 0 bytes.
7.3.1
This issue is now resolved by introducing a check on the size of the containers allowed to leave the source node during the balancing process.

Apache Jira: HDDS-10483

CDPD-67278: Fix the DN links on the Ozone SCM UI. This is a backport of KNOX-3012.
7.3.1
A change in Ozone affected Knox on the Ozone SCM UI. The links for the datanodes did not route through Knox. This issue is now resolved and the DN links will redirect to the correct Knox URLs.
CDPD-67095: DN URL in SCM Page through Knox redirects to non-Knox URL.
A change in Ozone affected Knox on the Ozone SCM UI. The links for the datanodes did not route through Knox. With CDPD-67278 and CDPD-69143, this issue is now resolved and the DN links will redirect to the correct Knox URLs.
CDPD-64874: Intermittent failure in TestOzoneRpcClientAbstract.testListSnapshot.
7.3.1
This issue is now resolved by fixing listSnapshotAPI intermittent wrong data issues. The listSnapshot API uses the org.apache.hadoop.ozone.om.ListIterator.MinHeapIterator which internally uses both CacheIterator and DBIterator and DBIterator had the logic of checking if rocks DB key is present in cache in org.apache.hadoop.ozone.om.ListIterator.DbTableIter#getNextKey. This checks the cache from table cache which may be intermittently flushed and makes the addition of duplicate entry in org.apache.hadoop.ozone.om.ListIterator.MinHeapIterator. You must use the pre-loaded keys in org.apache.hadoop.ozone.om.ListIterator.CacheIter#cacheKeyMap in org.apache.hadoop.ozone.om.ListIterator.CacheIter.

Apache Jira: HDDS-9967

CDPD-64815: NSSummary commands should close OzoneClient.
7.3.1
NSSummaryAdmin creates OzoneClient for some bucket-related checks. This issue now resolves:
  • Close client when no longer needed
  • Reuse client (or even bucket after lookup) for all checks

Apache Jira: HDDS-9944

CDPD-64209: Ozone Recon - Potential memory overflow in Container Health Task.
7.3.1
This issue is now resolved by fixing the Potential memory overflow in Container Health Task of Recon.

Apache Jira: HDDS-9819

CDPD-63596: Do not include SpotBugs at compile scope.
7.3.1
This issue is now resolved by removing spotbugs-annotation, an LGPL thirdparty dependency from the Ozone package.

Apache Jira: HDDS-9692

CDPD-62991: Recon UI - Bucket Drop down filter is not getting disabled when more than 1 volume is selected. This is a backport of HDDS-9556.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9556

CDPD-62931: Incorrect pipeline ID for closed container.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9544

CDPD-62925: Ozone debug chunkinfo command shows incorrect number of entries.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9542

CDPD-62471: Recon UI - Disk Usage page should reflect the information it displays.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9465

CDPD-62466: Improve thread names in Recon.
7.3.1
This issue is resolved by improving the thread naming in Recon process.
  1. Pass Recon as a thread name prefix in Recon.
  2. Ensure all other threads created in Recon code also include Recon in their name.

Apache Jira: HDDS-9470

CDPD-61700: Ozone debug chunkinfo shows incorrect block path for some nodes in a phatcat cluster.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9356

CDPD-60647: Snapshot purge should be an atomic operation.
7.3.1
This issue is resolved by fixing the OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in one transaction.

Apache Jira: HDDS-9198

CDPD-51724: SCM should avoid sending delete transactions for under-replicated containers.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-4368