Fixed issues in Ozone

Review the list of Ozone issues that are resolved in Cloudera Runtime 7.3.1, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.1.500 SP3

CDPD-80742: ConstraintViolationException was crashing the ContainerHealthTask in Ozone Recon
7.3.1.500
The ConstraintVoilationException was crashing the ContainerHealthTask task in Recon. This issue is now fixed and the task will not crash and continue to identify the unhealthy containers in SCM.

Apache Jira: HDDS-12585

CDPD-78266: Ozone Manager displays NullPointerException (NPE) when overwriting empty file using multipart upload
7.3.1.500
Previously, the S3MultipartUploadCompleteRequestWithFSO class displayed the NullPointerException error when an empty file was being overwritten by non-zero file because the omBucketInfo allows the null value. The null value is passed when no update was needed. This jssue is now fixed by checking the omBucketInfo before use.

Apache Jira: HDDS-12131

CDPD-80791: Ozone Recon - Unknown tar ball cleanup for Recon Ozone Manager DB snapshot
7.3.1.500
Previously, when Recon started or restarted, Recon checked for an existing Ozone Manager DB snapshot file if an old tar file was left in the Recon Ozone Manager DB directory location. The old tar file could be left forever, occupying disk space and requiring cleanup if the tarball is large in size. This issue is now fixed by cleaning up any unused or leftover tar files.

Apache Jira: HDDS-12601

CDPD-81786: Auto-compact tables which can tend to be large in size at intervals
7.3.1.500
Previously, when too many files and directories are deleted in a short period, RocksDB may accumulate too many tombstone which impedes the performance of regular file system operations. This issue is now fixed. The fix adds a background thread at Ozone Manager that can be optionally enabled to compact RocksDB tombstones at a fixed interval by adding the following configuration properties:
Configuration property Default value Description
ozone.compaction.service.enabled false Enables or disables a background job that periodically compacts rocksdb tables flagged for compaction.
ozone.om.compaction.service.run.interval 6h Specifies the interval for a background job to periodically compact rocksdb tables flagged for compaction. Units can be defined with postfix and can be ns, ms, s, m, h, or d.
ozone.om.compaction.service.timeout 10m Specifies a timeout value for the compaction service. If this is set to greater than 0, the service stops waiting for compaction completion after the set time. Units can be defined with postfix and can be ns, ms, s, m, h, or d.
ozone.om.compaction.service.columnfamilies keyTable,fileTable,directoryTable,deletedTable,deletedDirectoryTable,multipartInfoTable Specifies a comma separated, no spaces list of all the column families that are compacted by the compaction service. If this is empty, no column families are compacted.
Apache Jira: HDDS-12819
CDPD-87719: Ozone snapshot diff command is failing
7.3.1.500
The snapdiff operation could fail with a NullPointerException error if a directory had metadata changes, such as Access Control List (ACL) updates or modification time updates between snapshots. This issue is now resolved.
CDPD-83223: Validate Container Balancer move timeout and replication timeout configurations
7.3.1.500
Previously, if the move.timeout - move.replication.timeout - event.timeout.datanode.offset configuration was less than 9 minutes, the expiry time set on the delete commands was set in the past. Hence, those delete commands would be invalid. This issue is now fixed by adding the move.timeout - move.replication.timeout - event.timeout.datanode.offset rule for container balancer configurations. The timeout value must at least be 9 minutes.

Apache Jira: HDDS-13068

CDPD-83224: Container Balancer delete commands are sent with an expiration time in the past
7.3.1.500
Previously, delete commands were getting discarded in the container balancer due to incorrect expiration time. This issue is now fixed and the container balancer can now send delete commands correctly.

Apache Jira: HDDS-13067

Improvements for large scale deletion
7.3.1.500
In this release, to speed up Ozone data deletion, multiple issues are handled. The fixes under this issue focus on several critical areas, including the following areas:
  • Performance Improvements

    Improving speed that space is reclaimed from the system. Optimization of deletion processes to handle large-scale operations more efficiently.

  • Bug Fixes

    Addressing issues that occur specifically when deleting large amounts of data.

  • Observability Enhancements

    Improving observability throughout the deletion process through logs, metrics, and dashboards. New metrics have been introduced and added to Grafana, with a lightweight dashboard created to track the deletion progress.

Apache Jira: HDDS-11633, HDDS-13057, HDDS-13103, HDDS-13142, HDDS-13259 HDDS-12474, HDDS-11512, HDDS-11511, HDDS-11509, HDDS-11711 HDDS-11779, HDDS-12442

Cloudera Runtime 7.3.1.400 SP2

CDPD-82201: OMKeyAclRequestWithFSO is incorrectly setting full path as key name
7.3.1.400
When you set, add, or remove an ACL for a FSO bucket, the key name gets corrupted with the full key path. This fix ensures the correct key name is set during the ACL calls.

Apache Jira: HDDS-12891

CDPD-81939: Volume scanner should fail volume if rocksDB is inaccessible
7.3.1.400
When RocksDB becomes unreadable on a DataNode due to disk-related issues, the DataNode will mark the affected storage volume as unhealthy. This proactive health marking enables the system to initiate data replication processes more rapidly, thereby maintaining data availability and integrity.

Apache Jira: HDDS-12723

CDPD-78932: Container replication should be atomic
7.3.1.400
During container replication, the destination node imports the container from the source node. If any issues are encountered during the import process, the Datanode is responsible for gracefully cleaning up any residual or stale container metadata to maintain system integrity.

Apache Jira: HDDS-12233

CDPD-73278: Update OM, SCM, Datanode conf for RATIS-2135
7.3.1.400
Set raft.grpc.message.size.max to be 1MB larger than raft.server.log.appender.buffer.byte-limit for OM, SCM and Datanode.

Apache Jira: HDDS-11320

CDPD-57559: New Ozone Manager leader cannot verify the Ozone delegation token signed by old Ozone Manager leader
7.3.1.400
If an Ozone cluster is upgraded and then later downgraded, then the new Ozone Manager after downgrade cannot verify the new Ozone delegation tokens issued before downgrade causing the clients which are still running during this upgrade download period to fail. If there is no downgrade, then everything is fine.

This issue is fixed and this fix changes the Ozone delegation token sign from using asymmetric key to symmetric key.

Apache Jira: HDDS-8829

CDPD-70409: Recon Overview Page UI fails to load if Recon Solr Health throws error
7.3.1.400
This fixes an issue where the Recon UI failed to load if the Solr Health check API threw an error.

Cloudera Runtime 7.3.1.300 SP1 CHF 1

CDPD-80823: Snapshot creation is removing extra keys from the Active Object Storage's DB
7.3.1.300
Wrong keys were trapped in the DeletedTable of the snapshot if the OBS bucket name is a prefix for another OBS bucket, resulting in orphaned blocks. After the fix, it will not remove any extra keys from the DeletedTable for Active Object storage.

Apache Jira: HDDS-12611

CDPD-73375: Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly
7.3.1.300
Prometheus sink already have a mechanism to publish metrics every 10 seconds by default using call back with timer event. So, we removed the code to publish immediately and this issue is fixed.

Apache Jira: HDDS-12193

CDPD-78671: Metric timer task is blocking installSnapshotFromLeader on follower node
7.3.1.300
ozone.om.snapshot.rocksdb.metrics.enabled is now available in Ozone to disable metric collection on the snapshotted DB if necessary.

Apache Jira: HDDS-11339

CDPD-78781: Tarball creation interfering with snapshot purge
7.3.1.300
Synchronize SnapshotDeletingService on BootstrapStateHandler.Lock to make sure that no background service is running when Tarball creation is in progress.

Apache Jira: HDDS-12210

Cloudera Runtime 7.3.1.200 SP1

CDPD-74556: EC Checksum throws IllegalArgumentException because the buffer limit is negative
7.3.1.200
When ozone.client.bytes.per.checksum is set to a lower value (for example, 16kb), the parity checksum calculation during validation phase is wrong leading to IllegalArgumentException on client. This is now fixed.

Apache Jira: HDDS-11482

CDPD-75981: Default native ACL limits to user and user's primary group
7.3.1.200
Default native ACL created for an object such as volume, bucket, or file limits to the object owner and owner's primary group.

Apache Jira: HDDS-11656

CDPD-72782: Ozone write does not work when http proxy is set for the JVM
7.3.1.200
GRPC uses HTTP internally for its connections and due to this, if HTTP proxy is configured for any Ozone process using GRPC , it directs each call through the proxy even for GRPC which is not desirable for performance. Hence this fix disables proxy for GRPC connections that Ozone uses.

Apache Jira: HDDS-11257

CDPD-65714: Allow FS client to specify EC as default filesystem replication
7.3.1.200
This fix allows you to specify EC as the default replication type for a file uploaded through the Hadoop FileSystem API to Ozone through a client side configuration option.

Apache Jira: HDDS-10336

Cloudera Runtime 7.3.1.100 CHF 1

There are no fixed issues in this release.

Cloudera Runtime 7.3.1

OPSAPS-71474: In Cloudera Manager UI, the Ozone service Snapshot tab displays label label.goToBucket and it must be changed to Go to bucket.
7.3.1
This issue is now resolved.
OPSAPS-70288: Improvements in master node decommissioning.
7.3.1
This issue is now resolved by making usability and functional improvements to the Ozone master node decommissioning.
CDPD-74756: Update Ratis to 3.1.1
7.3.1
Updated Ratis dependency version from 3.1.0 to 3.1.1.

Apache Jira: HDDS-11504

CDPD-74241: OmSnapshotPurge should be in a different Ozone manager double buffer batch.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-11453

CDPD-74200: Recon UI shows incorrect data about volume, bucket, and keys. Recon is unable to sync its data with OM DB.
7.3.1
This issue is now resolved.
CDPD-74074: The /v1/triggerdbsync/om api is working with non-admin user even if security is enabled.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-11436

CDPD-73775: Replace solr.version with solr_lkgb_jar_version for Ozone to use downstream version of Solr.
7.3.1
Replaced pom solr.version with solr_lkgb_jar_version for Ozone to use downstream version of Solr.
CDPD-73447: Incorrect number of deleted containers shown in Recon UI.
7.3.1
The addition of the EMPTY_MISSING state to the retainOrUpdateRecord method enables Recon to correctly identify and manage the lifecycle of these containers, ensuring that no stale or deleted containers ramin in memory or in Recon's records.
CDPD-73330: The namespace quota and namespace dist commands fail and displays the Path not found in the system error for the existing volume or bucket.
7.3.1
Fixed the Ozone admin namespace summary.

Apache Jira: HDDS-10581

CDPD-72142: Keys from DeletedTable and DeletedDirTable of Active Object Store (AOS) should be deleted on batch operation while creating a Snapshot.
7.3.1
On snapshot creation, DeletedTable and DeletedDirTable of AOS is cleared. This operation is not performed in the same transaction as Snapshot create which can cause an orphan block objects in case of bootstrapping and lagging follower. This issue is now resolved and Snapshot creation and clearing of the DeletedTableand DeletedDirTable on AOS will be a single batch operation.

Apache Jira: HDDS-11183

CDPD-72076: The OMDoubleBuffer error is displayed when handling OMRequest: cmdType: SnapshotMoveDeletedKeys.
7.3.1
This fixes OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in the one transaction.

Apache Jira: HDDS-11152

CDPD-72019: Remove the locks from SnapshotPurge and SnapshotSetProperty APIs.
7.3.1
This fixes the OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in one transaction.

Apache Jira: HDDS-11137

CDPD-71702: Ozone Manager is down to Snapshot Chain Corruption.
7.3.1
SSTFilteringService directly updates the snapshotInfoTable which can cause the snapshot chain corruption if OM crashes before DB gets flushed for snapshot purge and SSTFilteringService has updated the next snapshot in the chain.

Apache Jira: HDDS-11068

CDPD-71584: Ozone Recon DecomissioningInfo API throws displays the NPE error.
7.3.1
This issue is resolved by fixing the NullPointerException when running DecomissioningInfo API.

Apache Jira: HDDS-11045

CDPD-71502: Ozone Recon - Decommissioned datanodes show up even after removing it from the Recon Datanodes page.
7.3.1
Recon previously allowed to remove the Decommissioned datanodes and was removing from Recon rocksDB nodes table. However, Decommissioned datanodes continue to send heartbeats till they are being shutdown. This gets registered and added again in the Recon memory map which makes them show up again in datanodes UI. This issue is now resolved and allows only decommissioned datanodes to be removed and skip other node status or node operational status datanodes.

Apache Jira: HDDS-11032

CDPD-70469: Ozone Recon - Handle startup failure and log reasons as error because SCM non-HA is enabled.
7.3.1
This issue is now resolved by fixing the Recon startup failure when SCM runs in non-HA mode.

Apache Jira: HDDS-10937

CDPD-68912: Ozone Recon - Improve Recon startup failure handling.
This issue is now resolved. Recon should recover from Cloudera Runtime or unexpected failures during startup and provide information on Recon UI. Recon can fail to start due to several reasons:
  • Failure of registering of datanodes or invalid topology.
  • Initialization of pipelines.

Apache Jira: HDDS-10702

CDPD-67668: Ozone Recon - Provide DN decommissioning detailed status and information inline with current CLI command output.
7.3.1
This issue resolved by adding a new improvement to provide API in Recon for DN decommissioning. Status and information is now inline with current CLI command output.

Apache Jira: HDDS-10514

CDPD-67460: Container Balancer should only move containers with size greater than 0 bytes.
7.3.1
This issue is now resolved by introducing a check on the size of the containers allowed to leave the source node during the balancing process.

Apache Jira: HDDS-10483

CDPD-67278: Fix the DN links on the Ozone SCM UI. This is a backport of KNOX-3012.
7.3.1
A change in Ozone affected Knox on the Ozone SCM UI. The links for the datanodes did not route through Knox. This issue is now resolved and the DN links will redirect to the correct Knox URLs.
CDPD-67095: DN URL in SCM Page through Knox redirects to non-Knox URL.
A change in Ozone affected Knox on the Ozone SCM UI. The links for the datanodes did not route through Knox. With CDPD-67278 and CDPD-69143, this issue is now resolved and the DN links will redirect to the correct Knox URLs.
CDPD-64874: Intermittent failure in TestOzoneRpcClientAbstract.testListSnapshot.
7.3.1
This issue is now resolved by fixing listSnapshotAPI intermittent wrong data issues. The listSnapshot API uses the org.apache.hadoop.ozone.om.ListIterator.MinHeapIterator which internally uses both CacheIterator and DBIterator and DBIterator had the logic of checking if rocks DB key is present in cache in org.apache.hadoop.ozone.om.ListIterator.DbTableIter#getNextKey. This checks the cache from table cache which may be intermittently flushed and makes the addition of duplicate entry in org.apache.hadoop.ozone.om.ListIterator.MinHeapIterator. You must use the pre-loaded keys in org.apache.hadoop.ozone.om.ListIterator.CacheIter#cacheKeyMap in org.apache.hadoop.ozone.om.ListIterator.CacheIter.

Apache Jira: HDDS-9967

CDPD-64815: NSSummary commands should close OzoneClient.
7.3.1
NSSummaryAdmin creates OzoneClient for some bucket-related checks. This issue now resolves:
  • Close client when no longer needed
  • Reuse client (or even bucket after lookup) for all checks

Apache Jira: HDDS-9944

CDPD-64209: Ozone Recon - Potential memory overflow in Container Health Task.
7.3.1
This issue is now resolved by fixing the Potential memory overflow in Container Health Task of Recon.

Apache Jira: HDDS-9819

CDPD-63596: Do not include SpotBugs at compile scope.
7.3.1
This issue is now resolved by removing spotbugs-annotation, an LGPL thirdparty dependency from the Ozone package.

Apache Jira: HDDS-9692

CDPD-62991: Recon UI - Bucket Drop down filter is not getting disabled when more than 1 volume is selected. This is a backport of HDDS-9556.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9556

CDPD-62931: Incorrect pipeline ID for closed container.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9544

CDPD-62925: Ozone debug chunkinfo command shows incorrect number of entries.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9542

CDPD-62471: Recon UI - Disk Usage page should reflect the information it displays.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9465

CDPD-62466: Improve thread names in Recon.
7.3.1
This issue is resolved by improving the thread naming in Recon process.
  1. Pass Recon as a thread name prefix in Recon.
  2. Ensure all other threads created in Recon code also include Recon in their name.

Apache Jira: HDDS-9470

CDPD-61700: Ozone debug chunkinfo shows incorrect block path for some nodes in a phatcat cluster.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-9356

CDPD-60647: Snapshot purge should be an atomic operation.
7.3.1
This issue is resolved by fixing the OM crash issue when the follower is lagging and it executes purgeKeys or snapshotMoveDeletedKeys for the Snapshot in one transaction.

Apache Jira: HDDS-9198

CDPD-51724: SCM should avoid sending delete transactions for under-replicated containers.
7.3.1
This issue is now resolved.

Apache Jira: HDDS-4368