Known Issues in Apache Ozone

Learn about the known issues in Ozone, the impact or changes to the functionality, and the workaround.

After upgrading the cluster from CDP Private Cloud Base 7.1.8 to CDP Private Cloud Base 7.1.9 and Ozone is in the Non-HA environment, an exception message is observed during the finalization of the Ozone upgrade.
During the finalization of the upgrade, ClassNotFoundException for org.cloudera.log4j.redactor.RedactorAppender class was identified. The error message is harmless as the upgrade is successful. The error existed previously and does not affect the Ozone service and its operation.
CDPD-68951: In 7.1.9 CHF2 version and lower, the command ozone sh key list <bucket_path> displays the isFile flag in a key's metadata as false even when the key is a file. This issue is rectified in 7.1.9 CHF3. However, the pre-existing (pre-upgrade) key's metadata cannot be changed.
When using S3A committer with fs.s3a.committer.staging.conflict-mode=replace to write to FSO buckets, the client fails with the following error.
DIRECTORY_NOT_FOUND Failed to find parent directory of xxxxxxxx at at at at at at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest( at at$1( at java.base/java.util.concurrent.CompletableFuture$ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker( at java.base/java.util.concurrent.ThreadPoolExecutor$ at java.base/
This occurs because S3A uses multipart upload to commit job results in a batch. The staging committer's replace mode deletes the target directory before completing MPU. The problem is that FSO does not create intermediate directories during MPU, it does only for regular file/dir/key requests.
Use for ** affected versions.
CDPD-64398: In SCM with non-HA configuration, Secret key manager is not initialising. Hence, the startup of OM and Datanode is failing as it cannot get a secret key. This key is used when security is enabled and used in Block token and container token verification while communicating between the Ozone client, OM, and Datanode.
This issue does not occur in SCM with HA configuration.
You must force exit from the safe mode for SCM. This triggers the initialization of the secret key manager. Below are the options:
  • exit safe mode manually whenever SCM is started
  • disable safe mode by setting hdds.scm.safemode.enabled=false in safety valve for SCM configuration
Impact of disabling safe mode and the purpose of safe mode:
  • Write should not fail once the cluster is out of safe mode
  • Read of existing data should not fail after the cluster is out of safe mode
  • Unnecessary re-replication should be avoided during cluster restart

So disabling safe mode does not have any major impact on Ozone.

HDDS-9512: Ozone Datanode's new client port conflicts with HDFS Datanode's web port if both Ozone and HDFS Datanode roles are placed on the same host.
You must set hdds.datanode.client.port to any unused port. For example, 19864, through the Ozone Datanode safety valve.
CDPD-52412: keyManagerImpl#listStatus exceeds the maximum RPC length.
To reduce the payload, you must increase the part size.
OPSAPS-68159: If you did not deactivate and undistribute the Ozone parcel 718.1.0 on Cloudera Manager 7.7.1 + CDH 7.1.8 before upgrading to Cloudera Manager 7.11.3 + CDH 7.1.9, the "Error when distributing parcel to host, Ignoring non-compliant parcel manifest" error is displayed after Cloudera Manager upgrade to 7.11.3.
If you encounter the error, perform the following steps:
  1. You must deactivate and undistribute the Ozone parcel 718.1.0 on Cloudera Manager 7.11.3.
  2. Restart the cluster with a delay of 10 minutes.
  3. Continue with the CDH 7.1.8 upgrade to CDH 7.1.9.
OPSAPS-68159: If you did not deactivate the Ozone parcel 718.2.x on Cloudera Manager 7.7.1 + CDH 7.1.8 before upgrading to Cloudera Manager 7.11.3 + CDH 7.1.9, the Ozone roles go down during the CDH 7.1.8 upgrade to CDH 7.1.9.
If you encounter the error, perform the following steps:
  1. Deactivate the Ozone parcel 718.2.x.
  2. Restart the Ozone service.
  3. Perform Finalize Upgrade for Ozone service.

Step result: The Ozone roles will come up green.

CDPD-60989: The packaging version for Apache Ozone points to the 1.2.0 older version. This is a version string problem and not a packaging issue. The version of the Apache Ozone binary is closest to 1.4.0.
None. This only affects the jar versioning.
CDPD-60366: Native library loader fails when system property native.lib.tmp.dir is not set. It fails because the library is copied to / instead of the cwd.
Setting native.lib.tmp.dir to a certain path like /tmp should solve the issue. To set this system property, you must add -Dnative.lib.tmp.dir=/tmp to ozone_java_opts in cloudera manager configuration of the ozone cluster.
CDPD-60489: Jackson-dataformat-yaml 2.12.7 and Snakeyaml 2.0 are not compatible.
You must not use Jackson-dataformat-yaml through the platform for YAML parsing.
CDPD-60598: Datanode can consume up to the JVM heap size worth of direct memory buffers under specific failure scenarios. This can lead to a Datanode crash. This is due to netty allocating and not freeing the direct memory buffers.
Restarting Datanode will resolve this issue.
CDPD-60466: The Ozone client is missing in the Cloudera Manager's Service Monitor node. This causes Ozone Canary's health check to fail.
Ensure you install Ozone CLI on the same node as Cloudera Manager's Service Monitor.
CDPD-60012: The Ozone SCM may be stuck in safe mode after upgrade to 7.1.9. This is due to a bug in how the SCM accounts for open storage containers.
The workaround is to temporarily set the Ozone configuration hdds.scm.safemode.threshold.pct to a lower value like 0.90 and restart the SCM.
CDPD-59679: In some scenarios, excess UNHEALTHY replicas of EC containers may not be removed from the cluster.
CDPD-50116: Topology aware reads can provide better performance by routing applications to the closest replica of a file if available on the same physical node or same rack. However this feature is disabled by default in CDP 7.1.9.
This feature can be enabled by setting to true via a Cloudera Manager safety valve.
CDPD-57165: The DataNode disk usage thread may abort due to Ratis log tailing.
Use org.apache.hadoop.hdds.fs.DedicatedDiskSpaceUsageFactory to calculate disk usage. Configure in the ozone-site.xml property: hdds.datanode.du.factory value: org.apache.hadoop.hdds.fs.DedicatedDiskSpaceUsageFactory
OPSAPS-63510: When Ozone Container Balancer is started using Activate Container Balancer from Cloudera Manager, it will run on the Storage Container Manager (SCM) host which is the RATIS leader. However, the link to the Full Log File under Role Log in the Cloudera Manager command window for the Activate Container Balancer command may not link to the leader SCM's logs.
  1. Find the leader SCM. Using Cloudera Manager and SCM Web UI: Go to Clusters>Ozone>Web UI. Open any of the Storage Container Manager web UI. In the UI, search for SCM Roles (HA) in the Status section. The leader SCM's hostname is mentioned.
  2. Using Terminal: Login to any Ozone host and run ozone admin scm roles. Note the leader.
  3. After finding the leader SCM, search in this leader host's logs for ContainerBalancer related logs.
OPSAPS-67373: Toggling the Enable Ozone S3 Multi-Tenancy feature configuration in the Cloudera Manager Ozone service configuration page affects more service roles than actually needed.
Enabling multi-tenancy only requires restarting the Ozone Managers.
CDPD-55513: Hive external table replication policy does not migrate Hive tables in HDFS storage to Ozone.
To replicate data from HDFS to Ozone, see Migrating your data from HDFS to Ozone. You can use HMS Mirror to replicate the metadata.
OPSAPS-67757: Hive external tables in Ozone storage cannot be replicated using Hive external table replication policies.
To replicate the Hive external tables' data, consider using DistCp. To replicate the metadata of Hive external tables, consider using HMS Mirror.
Remove bucket recursively using rb --force command from AWS S3 cannot work for FSO buckets.
Use the Ozone shell command ozone sh bucket delete -r <Bucket address>
CDPD-59126: Info log displays "noexec permission on /tmp/liborg_apache_ratis_thirdparty_netty_transport_native_epoll_x86" on client while executing command with noexec on /tmp.
To suppress INFO log related to liborg_apache_ratis_thirdparty_netty_transport_native_epoll_x86 library: Export OZONE_OPTS environment variable on the client terminal by running the command export OZONE_OPTS=" $OZONE_OPTS"
OPSAPS-67650: Ozone uses RocksDB as a library to persist metadata locally.
By default, RocksDB put some executables in /tmp, and thus encounters errors when /tmp is mounted with noexec.
The workaround is to configure RocksDB to put executables at another location. On a PhatCat node, the steps are:
  1. Go to Cloudera Manager UI > OZONE > Configuration.
  2. Find Ozone Service Environment Advanced Configuration Snippet (Safety Valve) and set the following environment variable: ROCKSDB_SHAREDLIB_DIR=/var/tmp
  3. Restart Ozone.
CDPD-49137: OM kerberos token expires for SCM communication and OM does not log in again.
Sometimes, OM's kerberos token is not updated and it stops to communicate with SCM. When this occurs, writes start failing.
Restart OM or set the safety valve hadoop.kerberos.keytab.login.autorenewal.enabled = true
CDPD-56684: Keys get deleted when you do not have permission on volume
When a volume is deleted, it recursively deletes the buckets and keys inside it and only then deletes the volume. The volume delete ACL check is done only in the end, due to which you may end up deleting all the data inside the volume without having delete permission on the volume.
CDPD-50610: Large file uploads are slow with OPEN and stream data approach
Hue file browser uses the append operation for large files. This API is not supported by Ozone in 7.1.9 and therefore large file uploads can be slow or timeout from the browser.
Use native Ozone client to upload large files instead of the Hue file browser.
OPSAPS-64097: Ozone service restart failed at SCM
Stopping SCM service using Cloudera Manager can sometimes timeout and need a retry. The Cloudera Manager API waits for 90 seconds which is not sufficient under certain circumstances.
Retry the shutdown using Cloudera Manager if the SCM still shows up as running after a refresh of the service status.
OPSAPS-66469: Ozone-site.xml is missing if the host does not contain HDFS roles
The client side ozone-site.xml (/etc/hadoop/conf/ozone-site.xml) is not generated by Cloudera Manager if the host does not have any HDFS role. Because of this, issuing Ozone commands from that host will fail because it cannot find the service name to host name mapping. The error message is similar to this: # ozone sh volume list o3://ozoneabc 23/03/06 18:46:15 WARN ha.OMProxyInfo: OzoneManager address ozoneabc:9862 for serviceID null remains unresolved for node ID null Check your ozone-site.xml file to ensure ozone manager addresses are configured properly.
Add the HDFS gateway role on that host.
OPSAPS-67607: Cloudera Manager FirstRun failure at the “Upload YARN MapReduce Framework JARs” step.
If this failure is attributed to the broken symbolic link, /var/lib/hadoop-hdfs/ozone-filesystem-hadoop3.jar, it is likely due to the presence of the user hdfs on the node prior to CDP parcel activation. As a result, the Cloudera Manager agent skips the initialization related to HDFS, leading to the non-creation of the /var/lib/hadoop-hdfs directory.
Create the directory “/var/lib/hadoop-hdfs” on all nodes followed by the deactivation and activation of the CDP parcel (deactivate and activate the Ozone parcel instead in case Ozone parcel is used).
CDPD-50447: When SCM High Availability is enabled, each of the SCM web UIs report the host of the web ui as the leader of HA, and the other two as followers. This gives wrong information
Correct output is available by running the ozone admin scm roles --service-id=<ID> command.
OPSAPS-66501: Currently it is not possible to configure High Availability for SCM roles in Ozone post deployment. We should be able to change the HA configuration through CM, bringing it in line with other services.
At present it requires deleting Ozone and then adding it back with the SCM HA configuration in place and manually cleanup the Ozone data in between. For more information, read the KB article.
OPSAPS-66500: Currently, it is not possible to enable Kerberos in Ozone after it has been deployed, despite all the required configuration changes being created when the box is checked in the Ozone configurations in Cloudera Manager.
Ozone must be deleted and redeployed with Kerberos enabled. Due to OPSAPS-66499, this requires manual data cleanup in between. For more information, read the KB article.
OPSAPS-66499: When you delete Ozone from a cluster using Cloudera Manager, Ozone data is not cleaned up. This may cause issues when Ozone is redeployed.
You must clean up the data manually. For more information, read the KB article.
OPSAPS-62327: In an Ozone cluster without any gateway roles, Ozone is unable to deploy client configurations and displays the ConfigGenException error.
You must add the Ozone gateway roles to the cluster.
CDPD-49027: SCM certificates are not renewed automatically
The certificates that are there to ensure encrypted communication and authentication between Ozone internal services are not renewed automatically for Storage Container Managers. The default lifetime of these certificates are 5 years from the initial security bootstrap of the cluster.
Certificate revocation

Once these certificates expire, a manual re-bootstrap of the internal Ozone certificates is necessary.

To revoke a certificate, remove the full trust chain to stop trusting a compromised certificate. For this, remove the SCM certificates or any other certificates from the system. During the startup of the system, new certificates are created and distributed. The old certificates are not trusted anymore as the root CA certificate changes as well.

Procedure to force revoke internal certificates:

  1. Stop Ozone service and all of its roles including SCMs
  2. Include SCM's certs folders. Note that the Primordial SCM node will have two certs folder, one for the root CA and other for the intermediate CA that the node holds. Rest of the SCMs will have just one folder for the intermediate CA role that the node serves. The modified command is: find / -name ozone-metadata 2>/dev/null | while read line; do find $line -name certs; done
  3. Move these certs directories to a backup location
  4. Locate the key material and move it to a backup folder. The modified command is: find / -name ozone-metadata 2>/dev/null | while read line; do find $line -name keys; done
  5. Move these keys directories to a backup location
  6. The VERSION file of SCM has to be updated similarly to Ozone Manager's VERSION file. To locate both the SCM and OM VERSION files on the hosts, execute the following command: find / -name om -o -name scm 2>/dev/null | while read line; do find $line -name VERSION; done | sort | uniq
  7. Backup the version file (just in case you need to restore for any reason)
  8. In OM's VERSION file remove the line starting with omCertSerialId, in SCM's VERSION file remove the line starting with scmCertSerialId.
  9. Start the stopped Ozone roles and certificates will be regenerated during startup.
CDPD-35632: The default block level checksum doesn't work when running distcp from HDFS to Ozone or the other way around, because the two file systems could well manage underlying blocks very differently.
Use a file level checksum instead. For example, append `-Ddfs.checksum.combine.mode=COMPOSITE_CRC` to the distcp command.
CDPD-43942: Requests to modify an Ozone S3 tenant may fail with the error "Timed out acquiring authorizer write lock. Another multi-tenancy request is in-progress." even if another request is not in progress.
Retry the request.
CDPD-36389: The configurations "datanodes.involved.max.percentage.per.iteration" and "size.moved.max.per.iteration" are meant to limit the max number of datanodes that'll be involved and max size that can move in an iteration. This bug will cause balancer to stop an iteration when it's 2 DNs or 1 Container size (5GB) away from hitting these limits. However, these datanodes can again be considered for balancing in the next iteration. This means the cluster will end up balanced after enough iterations, albeit a bit slowly. This bug is apparent in small clusters of around 4 DNs where the DN could be either the source or target for a lot of moves but the iteration gets stopped when 3 DNs have been involved. It'll take a higher number of iterations to eventually balance this cluster. While this is a performance issue, it doesn't prevent balancer from ultimately balancing the cluster. To find out if this bug is being hit, search for "Hit max datanodes to involve limit" and "Hit max size to move limit" in Debug logs.
Increase the speed for balancing by decreasing the interval between each iteration using the configuration "balancing.iteration.interval". Note that the value of this configuration must be greater than "hdds.datanode.du.refresh.period". "size.moved.max.per.iteration" can be increased to allow more data to move in one iteration.
CDPD-22519: HDFS user is unable to ozone scm client CLI. As workaround, SCM client CLIs are run using scm user.
CDPD-34187: This is a usability issue where warnings are displayed on the console while running ozone fs/CLI commands, which are of no use and restricts user experience. We should suppress these messages from the user console but at the same time make sure they still get printed out in the SCM Logs so that we could use them for debugging purposes.
Instead of logging into the user console, you redirect these log messages to a file called which should avoid warnings to the user. Ozone-shell commands used earlier a similar method of directing messages to the LogFile. I have filed an apache Jira for it and have also fixed the issue.
CDPD-35141: Error: Error while compiling statement: FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source <bucket1> to destination <bucket2> (state=08S01,code=40000) java.sql.SQLException: Error while compiling statement: FAILED: Execution Error, return code 40000 from org.apache.hadoop.hive.ql.exec.MoveTask. Unable to move source <bucket1> to destination <bucket2>. We may see the above issue if the source and target buckets are different in Hive queries. For now, copying across the same bucket is only supported.
Avoid different buckets in source and target path.
CDPD-40594: Ozone admin container create command doesn't work. The command fails at getCAList for the SCM Client to create a container.
Avoid using create container command
CDPD-40966: df command on ozone returns incorrect result.
CDPD-41184: With LEGACY buckets, FileSystem op is not interoperating with the Ozone shell command. Cause:- The directory key entry in the DB KeyTable stored as "dir1/" with trailing slash. But while performing the described operation, Ozone shell (o3://) is normalizing the given path and removed the trailing slash "/" from it. That resulted in KEY_NOT_FOUND exception.
There are three workarounds:
  • Use FileSystem API to Delete the Directories rather than Shell-Command API.
  • Use FSO buckets instead of Legacy Buckets. As in FSO, you can create Intermediate Directories and Delete Directories using the Ozone shell commands.
  • Disable and set the configuration to false in order to delete the directories. This is generally not a preferred workaround because the cluster must be restarted again to pick up the new changes.
CDPD-34867: Container Balancer might not balance if only Over-Utilized or only Under-Utilized datanodes are reported. The log line will look like this: "Container Balancer has identified x Over-Utilized and y Under-Utilized Datanodes that need to be balanced" where one of x or y will be 0.
Decrease the threshold using "utilization.threshold". This will allow balancer to find non zero number of both over and under utilized nodes.
CDPD-12966: Ozone du -s -h should report correct values with replication information.
CDPD-12542: Mount of Ozone filesystem with the help of FUSE fails.
CDPD-31910: If its a non ranger deployment, the owner/group are shown based on kerberos user or sudo user.
For correct owner/group, user would need a Ranger deployment.
CDPD-42691: During the upgrade - all pipelines will be closed when the upgrade is finalized on SCM, temporarily bringing the cluster to a read-only state.
When you execute the finalize command, the cluster will temporarily go into a read-only state.
CDPD-42945: When many EC buckets are created with different EC chunk sizes, it creates pipeline for each chunk size. As a result, large number of pipelines are created in the system.
OPSAPS-60721: Ozone SCM Primordial Node ID is a required field which needs to be specified with one of the SCM hostnames during Ozone HA installation. In Cloudera Manager this field is not mandatory during Ozone deployment, this can cause end users continue further with installation which causes startup to fail in Ozone services.
Make sure during ozone HA installation Ozone SCM Primordial Node ID is specified with one of the SCM hostname.
HDDS-4209: S3A Filesystem does not work with Ozone S3 in file system compat mode. When you create a directory, the S3A filesystem creates an empty file. When the parameter is enabled, the hdfs dfs -mkdir -p s3a://b12345/d11/d12 command runs successfully. However, running the hdfs dfs -put /tmp/file1 s3a://b12345/d11/d12/file1 command fails with an error: ERROR Key creation failed.
The HDDS-4209 Jira fixes the file system semantics and management in Ozone. On top of the flat name structure, which is Pure Object store, as a workaround the Hierarchical namespace structure is added. This ensures S3A compatibility with Ozone.
CDPD-42897: EC writes are failing with "No enough datanodes to choose" after EC replication config set globally.
EC writes starts failing when large number of pipelines are created as a result of multiple EC configs with different chunk sizes used to write keys.
If standard EC configs (i.e, rs-3-2-1024k) are used to write keys, number of pipelines created per datanode will be limited to 5 and this issue is not seen with standard EC configs.
The recommendation is not to create too many random chunk sizes. It is configurable because, users can decide based on their workload. But not to have separate chunksizes for each file.
CDPD-41539: "No such file or directory" returned when EC file is read using older ofs client.
You must upgrade the client before trying to read the key: vol1/ecbuck1/1GB_ec".
CDPD-40560: Filesystem Operations via hadoop s3a connector on a FILE_SYSTEM_OPTIMIZED bucket is supposed to fail. Unable to get file status: volume: s3v bucket: fso key: test/
Don't run hadoop s3a commands on an FILE_SYSTEM_OPTIMIZED bucket. Use OBJECT_STORE bucket layouts.
CDPD-42832:With this issue, any long running setup or a prod server will result in data corruption resulting due to inconsistency issues. This may result in major issues with the existing LEGACY layout type.
The same test suites OzoneLongRunningTest ran with FILE_SYSTEM_OPTIMIZED("FSO") bucket layout type more than 65hrs without any issues. FSO provides atomicity and consistency guarantees for the path(dir or file) rename/delete operations irrespective of the large sub-dirs/files contained in it. This capabilities helps to make the long running test more consistent without any failures so far. Recommendation is to run bigdata HCFS workloads using the FSO bucket layout types.
CDPD-43432: Ozone Service in fault state in DataNode - Long Running setup.
Upgraded RocksDB to the latest version.
OPSAPS-63999: In the newly installed cluster, the Finish upgrade option is clickable.
OPSAPS-64648: Failed to start ozone node via CM if default log path /var/log/hadoop-ozone does not exist. If this path does not exists, any Ozone nodes(for example SCM or data node) restart will fail.
Run the following command sudo -u hdfs mkdir -p /var/log/hadoop-ozone or replace hdfs with the user Ozone roles that are running.
CDPD-45932: Investigate impersonation with "is admin" check in Ozone WebUIs /logLevel servlet endpoint
In a secure kerberized cluster, due to an impersonation issue, changing log levels via Knox on the corresponding endpoint of the WebUI does not work. Note that this is only true, when the WebUI is accessed via Knox, other means of changing log levels in Ozone services are not affected by this problem.
There is no workaround for this problem.