Fixed issues in Ozone

Review the list of Ozone issues that are resolved in Cloudera Runtime 7.3.1.

OPSAPS-71474: In Cloudera Manager UI, the Ozone service Snapshot tab displays label label.goToBucket and it must be changed to Go to bucket.
This issue is now resolved.
OPSAPS-70288: Improvements in master node decommissioning.
This issue is now resolved by making usability and functional improvements to the Ozone master node decommissioning.
CDPD-74756: Update Ratis to 3.1.1
Updated Ratis dependency version from 3.1.0 to 3.1.1.

Apache Jira: HDDS-11504

CDPD-74241: OmSnapshotPurge should be in a different Ozone manager double buffer batch.
This issue is now resolved.

Apache Jira: HDDS-11453

CDPD-74200: Recon UI shows incorrect data about volume, bucket, and keys. Recon is unable to sync its data with OM DB.
This issue is now resolved.
CDPD-74074: The /v1/triggerdbsync/om api is working with non-admin user even if security is enabled.
This issue is now resolved.

Apache Jira: HDDS-11436

CDPD-73775: Replace solr.version with solr_lkgb_jar_version for Ozone to use downstream version of Solr.
Replaced pom solr.version with solr_lkgb_jar_version for Ozone to use downstream version of Solr.
CDPD-73447: Incorrect number of deleted containers shown in Recon UI.
The addition of the EMPTY_MISSING state to the retainOrUpdateRecord method enables Recon to correctly identify and manage the lifecycle of these containers, ensuring that no stale or deleted containers ramin in memory or in Recon's records.
CDPD-73330: The namespace quota and namespace dist commands fail and displays the Path not found in the system error for the existing volume or bucket.
Fixed the Ozone admin namespace summary.

Apache Jira: HDDS-10581

CDPD-72142: Keys from DeletedTable and DeletedDirTable of Active Object Store (AOS) should be deleted on batch operation while creating a Snapshot.
On snapshot creation, DeletedTable and DeletedDirTable of AOS is cleared. This operation is not performed in the same transaction as Snapshot create which can cause an orphan block objects in case of bootstrapping and lagging follower. This issue is now resolved and Snapshot creation and clearing of the DeletedTableand DeletedDirTable on AOS will be a single batch operation.

Apache Jira: HDDS-11183

CDPD-72076: The OMDoubleBuffer error is displayed when handling OMRequest: cmdType: SnapshotMoveDeletedKeys.
This fixes OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in the one transaction.

Apache Jira: HDDS-11152

CDPD-72019: Remove the locks from SnapshotPurge and SnapshotSetProperty APIs.
This fixes the OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in one transaction.

Apache Jira: HDDS-11137

CDPD-71702: Ozone Manager is down to Snapshot Chain Corruption.
SSTFilteringService directly updates the snapshotInfoTable which can cause the snapshot chain corruption if OM crashes before DB gets flushed for snapshot purge and SSTFilteringService has updated the next snapshot in the chain.

Apache Jira: HDDS-11068

CDPD-71584: Ozone Recon DecomissioningInfo API throws displays the NPE error.
This issue is resolved by fixing the NullPointerException when running DecomissioningInfo API.

Apache Jira: HDDS-11045

CDPD-71502: Ozone Recon - Decommissioned datanodes show up even after removing it from the Recon Datanodes page.
Recon previously allowed to remove the Decommissioned datanodes and was removing from Recon rocksDB nodes table. However, Decommissioned datanodes continue to send heartbeats till they are being shutdown. This gets registered and added again in the Recon memory map which makes them show up again in datanodes UI. This issue is now resolved and allows only decommissioned datanodes to be removed and skip other node status or node operational status datanodes.

Apache Jira: HDDS-11032

CDPD-70469: Ozone Recon - Handle startup failure and log reasons as error because SCM non-HA is enabled.
This issue is now resolved by fixing the Recon startup failure when SCM runs in non-HA mode.

Apache Jira: HDDS-10937

CDPD-68912: Ozone Recon - Improve Recon startup failure handling.
This issue is now resolved. Recon should recover from Runtime or unexpected failures during startup and provide information on Recon UI. Recon can fail to start due to several reasons:
  • Failure of registering of datanodes or invalid topology.
  • Initialization of pipelines.

Apache Jira: HDDS-10702

CDPD-67668: Ozone Recon - Provide DN decommissioning detailed status and information inline with current CLI command output.
This issue resolved by adding a new improvement to provide API in Recon for DN decommissioning. Status and information is now inline with current CLI command output.

Apache Jira: HDDS-10514

CDPD-67460: Container Balancer should only move containers with size greater than 0 bytes.
This issue is now resolved by introducing a check on the size of the containers allowed to leave the source node during the balancing process.

Apache Jira: HDDS-10483

CDPD-67278: Fix the DN links on the Ozone SCM UI. This is a backport of KNOX-3012.
A change in Ozone affected Knox on the Ozone SCM UI. The links for the datanodes did not route through Knox. This issue is now resolved and the DN links will redirect to the correct Knox URLs.
CDPD-67095: DN URL in SCM Page through Knox redirects to non-Knox URL.
A change in Ozone affected Knox on the Ozone SCM UI. The links for the datanodes did not route through Knox. With CDPD-67278 and CDPD-69143, this issue is now resolved and the DN links will redirect to the correct Knox URLs.
CDPD-64874: Intermittent failure in TestOzoneRpcClientAbstract.testListSnapshot.
This issue is now resolved by fixing listSnapshotAPI intermittent wrong data issues. The listSnapshot API uses the org.apache.hadoop.ozone.om.ListIterator.MinHeapIterator which internally uses both CacheIterator and DBIterator and DBIterator had the logic of checking if rocks DB key is present in cache in org.apache.hadoop.ozone.om.ListIterator.DbTableIter#getNextKey. This checks the cache from table cache which may be intermittently flushed and makes the addition of duplicate entry in org.apache.hadoop.ozone.om.ListIterator.MinHeapIterator. You must use the pre-loaded keys in org.apache.hadoop.ozone.om.ListIterator.CacheIter#cacheKeyMap in org.apache.hadoop.ozone.om.ListIterator.CacheIter.

Apache Jira: HDDS-9967

CDPD-64815: NSSummary commands should close OzoneClient.
NSSummaryAdmin creates OzoneClient for some bucket-related checks. This issue now resolves:
  • Close client when no longer needed
  • Reuse client (or even bucket after lookup) for all checks

Apache Jira: HDDS-9944

CDPD-64209: Ozone Recon - Potential memory overflow in Container Health Task.
This issue is now resolved by fixing the Potential memory overflow in Container Health Task of Recon.

Apache Jira: HDDS-9819

CDPD-63596: Do not include SpotBugs at compile scope.
This issue is now resolved by removing spotbugs-annotation, an LGPL thirdparty dependency from the Ozone package.

Apache Jira: HDDS-9692

CDPD-62991: Recon UI - Bucket Drop down filter is not getting disabled when more than 1 volume is selected. This is a backport of HDDS-9556.
This issue is now resolved.

Apache Jira: HDDS-9556

CDPD-62931: Incorrect pipeline ID for closed container.
This issue is now resolved.

Apache Jira: HDDS-9544

CDPD-62925: Ozone debug chunkinfo command shows incorrect number of entries.
This issue is now resolved.

Apache Jira: HDDS-9542

CDPD-62471: Recon UI - Disk Usage page should reflect the information it displays.
This issue is now resolved.

Apache Jira: HDDS-9465

CDPD-62466: Improve thread names in Recon.
This issue is resolved by improving the thread naming in Recon process.
  1. Pass Recon as a thread name prefix in Recon.
  2. Ensure all other threads created in Recon code also include Recon in their name.

Apache Jira: HDDS-9470

CDPD-61700: Ozone debug chunkinfo shows incorrect block path for some nodes in a phatcat cluster.
This issue is now resolved.

Apache Jira: HDDS-9356

CDPD-60647: Snapshot purge should be an atomic operation.
This issue is resolved by fixing the OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in one transaction.

Apache Jira: HDDS-9198

CDPD-51724: SCM should avoid sending delete transactions for under-replicated containers.
This issue is now resolved.

Apache Jira: HDDS-4368