Known Issues in Ozone

Learn about the known issues in Ozone, the impact or changes to the functionality, and the workaround.

CDPD-73292: DataNode is supposed to log messages for slow requests to dn-audit.log. However, due to a count error, it will log almost every request.

Set the DataNode configuration hdds.datanode.slow.op.warning.threshold = 500000000ms. This configuration change will log only requests that complete in more than 500ms.

OPSAPS-71342: Configuring the hdds.x509.max.duration parameter to 0 or any negative value leads to shutdown of SCM, DN, and OM. This misconfiguration disrupts the entire cluster operations.

To avoid disruption, ensure that the value of hdds.x509.max.duration is set to a positive integer greater than 0.

ENGESC-26990: Standalone Ozone deployment in a cluster is not supported.

A cluster with Ozone but without HDFS is not supported. Setting the fs.defaultFs parameter to Ozone is not certified yet.

None.

The Prometheus binary is not available in CDP Private Cloud Base 7.1.9 SP1 for the Ubuntu operating system.

You can install Prometheus separately and specify the path to the parent directory, for example /usr/bin, in the prometheus.location parameter in Ozone.

In CDP Private Cloud Base 7.1.9 SP1, there is a missing block location in the output of Ozone debug chunkinfo command for EC bucket.

None.

In CDP Private Cloud Base 7.1.9 SP1 and CDP Private Cloud Base 7.1.7 SP3, there is an intermittent key put failure after stopping the OM leader.

None.

In CDP Private Cloud 7.1.9 SP1, Recon certificates are not available after running the ozone admin cert list command. This issue is present when upgrading from a lower version of CDP to CDP Private Cloud 7.1.9 SP1.

Ensure that the Recon user is included in the hdds.security.client.datanode.container.protocol.acl configuration in the hadoop-policy.xml file under /var/run/cloudera-scm-agent/process/*-ozone-STORAGE_CONTAINER_MANAGER/ozone-conf.
If the Recon user is not listed, SCM rejects Recon's request for certificates. This step must be completed before proceeding to the next step.
Remove the keys folder in the Recon metadata directory and restart the cluster. This action results in new keys and certificates being generated from SCM.

On upgrading the cluster from CDP Private Cloud Base 7.1.7 SP3 to CDP Private Cloud Base 7.1.9 SP1, you cannot create snapshots on a pre-upgrade volume or bucket and PERMISSION_DENIED error is displayed. This is because from CDP Private Cloud Base 7.1.7 SP3, complete Kerberos owner name with domain is used. However, in CDP Private Cloud Base 7.1.9 SP1 the Kerberos owner short name is used.

You must update the Kerberos owner name for the bucket with the user short name and then create snapshot by running the following commands:

ozone sh bucket update --user={user_short_name} {volume}/{bucket}
ozone sh snapshot create {volume}/{bucket} {snapshot_name}

CDPD-69017: An Ozone Manager fails to retrieve the certificates of other OMs. OM retrieval of certificates is necessary for delegation token verification. This problem occurs after an OM leader changes until jobs possess a ticket from the previous OM leader.

Copy the certificates of other OMs to the certificate directories of all OMs.

The JDK-8292158 and HADOOP-18699 bugs affect the following OpenJDK versions:

OpenJDK 11 (versions lower than 11.0.18)
OpenJDK 11 (versions lower than 11.0.18-oracle)
OpenJDK 15 (versions lower than 15.0.10)
OpenJDK 17 (versions lower than 17.0.6)
OpenJDK 17 (versions lower than 17.0.6-oracle)
OpenJDK 19 (versions lower than 19.0.2)

As a result, the Hadoop clients can experience network connection failure under the following conditions:

The host is capable of supporting AVX-512 instruction sets.
AVX-512 is enabled in Java Virtual Machine (JVM). This should be enabled by default on AVX-512 capable hosts, equivalent to specifying the JVM argument -XX:UseAVX=3
Hadoop native library (for example, libhadoop.so) is not available. So the HDFS client falls back using Hotspot JVM's aesctr_encrypt implementation for AES/CTR/NoPadding.
Client uses an affected JDK.

You must append -XX:UseAVX=2 to the client JVM args and upgrade to one of the following OpenJDK versions which has the fix:

OpenJDK 11 release: version 11.0.18 and higher
OpenJDK 11 release: version 11.0.18-oracle and higher
OpenJDK 15 release: version 15.0.10 and higher
OpenJDK 17 release: version 17.0.6 and higher
OpenJDK 17 release: version 17.0.6-oracle and higher
OpenJDK 19 release: version 19.0.2 and higher

After upgrading the cluster from CDP Private Cloud Base 7.1.8 to CDP Private Cloud Base 7.1.9 and if Ozone is in the Non-HA environment, an exception message is observed during the finalization of the Ozone upgrade.

During the finalization of the upgrade, a ClassNotFoundException message for org.cloudera.log4j.redactor.RedactorAppender class may be displayed. The error message can be ignored because the upgrade is successful. The error existed previously and does not affect the Ozone service and its operation.

None.

CDPD-68951: In CDP Private Cloud Base 7.1.9 CHF2 version and lower, the command ozone sh key list <bucket_path> displays the isFile flag in a key's metadata as false even when the key is a file. This issue is rectified in CDP Private Cloud Base 7.1.9 CHF3. However, the pre-existing (pre-upgrade) key's metadata cannot be changed.

None

When using S3A committer fs.s3a.committer.name=directory with fs.s3a.committer.staging.conflict-mode=replace to write to FSO buckets, the client fails with the following error.

DIRECTORY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Failed to find parent directory of xxxxxxxx at org.apache.hadoop.ozone.om.request.file.OMFileRequest.getParentID(OMFileRequest.java:1008) at org.apache.hadoop.ozone.om.request.file.OMFileRequest.getParentID(OMFileRequest.java:958) at org.apache.hadoop.ozone.om.request.file.OMFileRequest.getParentId(OMFileRequest.java:1038) at org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequestWithFSO.getDBOzoneKey(S3MultipartUploadCompleteRequestWithFSO.java:114) at org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest.validateAndUpdateCache(S3MultipartUploadCompleteRequest.java:157) at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:378) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:568) at org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:363) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

This occurs because S3A uses multipart upload to commit job results in a batch. The staging committer's replace mode deletes the target directory before completing Memory Protection Unit (MPU). The problem is that File System Optimization (FSO) does not create intermediate directories during MPU, it does only for regular file/dir/key requests.

Use fs.s3a.committer.name=magic for affected versions.

HDDS-9512: Ozone DataNode's new client port conflicts with HDFS DataNode's web port if both Ozone and HDFS DataNode roles are placed on the same host.

You must set hdds.datanode.client.port to any unused port. For example, 19864, through the Ozone DataNode safety valve.

OPSAPS-68159: If you did not deactivate and undistribute the Ozone parcel 718.1.0 on Cloudera Manager 7.7.1 with CDH 7.1.8 before upgrading to Cloudera Manager 7.11.3 with CDH 7.1.9, the Error when distributing parcel to host, Ignoring non-compliant parcel manifest error is displayed after Cloudera Manager is upgraded to 7.11.3.

If you encounter the error, perform the following steps:

Deactivate and undistribute the Ozone parcel 718.1.0 on Cloudera Manager 7.11.3.
Restart the cluster with a delay of 10 minutes.
Continue to upgrade CDH 7.1.8 to CDH 7.1.9.

OPSAPS-68159: If you did not deactivate the Ozone parcel 718.2.x on Cloudera Manager 7.7.1 with CDH 7.1.8 before upgrading to Cloudera Manager 7.11.3 with CDH 7.1.9, the Ozone roles during the CDH 7.1.8 upgrade to CDH 7.1.9.

If you encounter the error, perform the following steps:

Deactivate the Ozone parcel 718.2.x.
Restart the Ozone service.
Perform Finalize Upgrade for the Ozone service.

Step result: The Ozone roles will be displayed in green.

CDPD-60989: The packaging version for Apache Ozone points to the 1.2.0 older version. This is a version string problem and not a packaging issue. The version of the Apache Ozone binary is closest to 1.4.0.

None. This only affects the JAR versioning.

CDPD-60489: Jackson-dataformat-yaml 2.12.7 and Snakeyaml 2.0 are not compatible.

You must not use Jackson-dataformat-yaml through the platform for YAML parsing.

OPSAPS-63510: When Ozone Container Balancer is started using Activate Container Balancer from Cloudera Manager, it runs on the Storage Container Manager (SCM) host which is the Ratis leader. However, the link to the Full Log File under Role Log in the Cloudera Manager command window for the Activate Container Balancer command may not link to the leader SCM's logs.

Find the leader SCM. Using Cloudera Manager and SCM Web UI: Go to Clusters > Ozone > Web UI. Open any of the Storage Container Manager web UI. In the UI, search for SCM Roles (HA) in the Status section. The leader SCM's hostname is mentioned. Or using Terminal: Log in to any Ozone host and run ozone admin scm roles. Note the leader.
After finding the leader SCM, search in this leader host's logs for ContainerBalancer related logs.

OPSAPS-67373: Toggling the Enable Ozone S3 Multi-Tenancy feature configuration in the Cloudera Manager Ozone service configuration page affects more service roles than actually needed.

Enabling multi-tenancy only requires restarting the Ozone Managers.

OPSAPS-67757: Hive external tables in Ozone storage cannot be replicated using Hive external table replication policies.

To replicate the Hive external tables' data, consider using DistCp. To replicate the metadata of Hive external tables, consider using HMS Mirror.

Remove bucket recursively using rb --force command from AWS S3 cannot work for FSO buckets.

Use the Ozone shell command ozone sh bucket delete -r [***BUCKET ADDRESS***]

CDPD-59126: Info log displays "noexec permission on /tmp/liborg_apache_ratis_thirdparty_netty_transport_native_epoll_x86" on client while executing command with noexec on /tmp.

To suppress Info log related to liborg_apache_ratis_thirdparty_netty_transport_native_epoll_x86 library: Export OZONE_OPTS environment variable on the client terminal by running the command export OZONE_OPTS="-Dorg.apache.ratis.thirdparty.io.netty.native.workdir=/var/tmp $OZONE_OPTS"

OPSAPS-67650: Ozone uses RocksDB as a library to persist metadata locally.

By default, RocksDB places certain executables in /tmp, and thus encounters errors when /tmp is mounted with noexec.

The workaround is to configure RocksDB to put executables at another location. On a PhatCat node, the steps are:

Go to Cloudera Manager UI > OZONE > Configuration.
Find Ozone Service Environment Advanced Configuration Snippet (Safety Valve) and set the following environment variable: ROCKSDB_SHAREDLIB_DIR=/var/tmp
Restart Ozone.

CDPD-49137: Ozone Manager Kerberos token expires for SCM communication and OM does not log in again.

Sometimes, OM's Kerberos token is not updated and it stops to communicate with SCM. When this occurs, writes start failing.

Restart OM or set the safety valve hadoop.kerberos.keytab.login.autorenewal.enabled = true

CDPD-56684: Keys get deleted when you do not have permission on volume

When a volume is deleted, it recursively deletes the buckets and keys inside it and only then deletes the volume. The volume delete ACL check is done only in the end, due to which you may end up deleting all the data inside the volume without having delete permission on the volume.

CDPD-50610: Large file uploads are slow with OPEN and stream data approach

Hue file browser uses the append operation for large files. This API is not supported by Ozone in 7.1.9 and therefore large file uploads can be slow or timeout from the browser.

Use native Ozone client to upload large files instead of the Hue file browser.

OPSAPS-66469: Ozone-site.xml is missing if the host does not contain HDFS roles

The client side ozone-site.xml (/etc/hadoop/conf/ozone-site.xml) is not generated by Cloudera Manager if the host does not have any HDFS role. Because of this, issuing Ozone commands from that host fails because it cannot find the service name to host name mapping. The error message is similar to this: # ozone sh volume list o3://ozoneabc 23/03/06 18:46:15 WARN ha.OMProxyInfo: OzoneManager address ozoneabc:9862 for serviceID null remains unresolved for node ID null Check your ozone-site.xml file to ensure ozone manager addresses are configured properly.

Add the HDFS gateway role on that host.

OPSAPS-67607: Cloudera Manager FirstRun failure at the “Upload YARN MapReduce Framework JARs” step.

If this failure is attributed to the broken symbolic link, /var/lib/hadoop-hdfs/ozone-filesystem-hadoop3.jar, it is likely due to the presence of the user hdfs on the node prior to CDP parcel activation. As a result, the Cloudera Manager agent skips the initialization related to HDFS, leading to the non-creation of the /var/lib/hadoop-hdfs directory.

Create the directory “/var/lib/hadoop-hdfs” on all nodes followed by the deactivation and activation of the CDP parcel (deactivate and activate the Ozone parcel instead, in case Ozone parcel is used).

OPSAPS-66501: Currently it is not possible to configure High Availability for SCM roles in Ozone post deployment. You should be able to change the HA configuration through Cloudera Manager, bringing it in line with other services.

At present it requires deleting Ozone and then adding it back with the SCM HA configuration in place and manually cleaning up the Ozone data in between. For more information, read the KB article.

OPSAPS-66500: Currently, it is not possible to enable Kerberos in Ozone after it has been deployed, despite all the required configuration changes being created when the box is checked in the Ozone configurations in Cloudera Manager.

Ozone must be deleted and redeployed with Kerberos enabled. Due to OPSAPS-66499, this requires manual data cleanup in between. For more information, read the KB article.

OPSAPS-66499: When you delete Ozone from a cluster using Cloudera Manager, Ozone data is not cleaned up. This may cause issues when Ozone is redeployed.

You must clean up the data manually. For more information, read the KB article.

CDPD-49027: SCM certificates are not renewed automatically

The certificates that are there to ensure encrypted communication and authentication between Ozone internal services are not renewed automatically for Storage Container Managers.

Certificate revocation

Once these certificates expire, a manual re-bootstrap of the internal Ozone certificates is necessary.

To revoke a certificate, remove the full trust chain to stop trusting a compromised certificate. For this, remove the SCM certificates or any other certificates from the system. During the startup of the system, new certificates are created and distributed. The old certificates are not trusted anymore as the root CA certificate changes as well.

Procedure to force revoke internal certificates:

Stop Ozone service and all of its roles including SCMs
Include SCM's certs folders. Note that the Primordial SCM node has two certs folder, one for the root CA and other for the intermediate CA that the node holds. Rest of the SCMs have just one folder for the intermediate CA role that the node serves. The modified command is: find / -name ozone-metadata 2>/dev/null | while read line; do find $line -name certs; done
Move these certs directories to a backup location
Locate the key material and move it to a backup folder. The modified command is: find / -name ozone-metadata 2>/dev/null | while read line; do find $line -name keys; done
Move these keys directories to a backup location
The VERSION file of SCM has to be updated similarly to Ozone Manager's VERSION file. To locate both the SCM and OM VERSION files on the hosts, execute the following command: find / -name om -o -name scm 2>/dev/null | while read line; do find $line -name VERSION; done | sort | uniq
Backup the version file (just in case you need to restore for any reason)
In OM's VERSION file remove the line starting with omCertSerialId, in SCM's VERSION file remove the line starting with scmCertSerialId.
Start the stopped Ozone roles and certificates will be regenerated during startup.

CDPD-35632: The default block level checksum does not work when running distcp from HDFS to Ozone or the other way around, because the two file systems manage underlying blocks very differently.

Use a file level checksum instead. For example, append `-Ddfs.checksum.combine.mode=COMPOSITE_CRC` to the distcp command.

CDPD-43942: Requests to modify an Ozone S3 tenant may fail with the error "Timed out acquiring authorizer write lock. Another multi-tenancy request is in-progress." even if another request is not in progress.

Retry the request.

CDPD-22519: HDFS user is unable to run Ozone SCM client CLI.

SCM client CLIs are run using SCM user.

CDPD-34187: This is an usability issue where warnings are displayed on the console while running Ozone fs/CLI commands, which restricts user experience. .

Instead of logging into the user console, you redirect these log messages to a file called ozone-shell-log4j.properties which should avoid warnings to the user. Ozone-shell commands used a similar method of directing messages to the LogFile.

CDPD-40594: Ozone admin container create command does not work. The command fails at getCAList for the SCM Client to create a container.

Avoid using the create container command

CDPD-40966: df command on Ozone returns incorrect result.

None

CDPD-34867: Container Balancer might not balance if only Over-Utilized or only Under-Utilized DataNodes are reported. The log line displays the "Container Balancer has identified x Over-Utilized and y Under-Utilized DataNodes that need to be balanced" message where one of x or y will be 0.

Decrease the threshold using "utilization.threshold". This allows the balancer to find non zero number of both over and under utilized nodes.

CDPD-12966: Ozone du -s -h should report correct values with replication information.

None

CDPD-31910: In a non-Ranger deployment, the owner/group are shown based on Kerberos user or sudo user.

For correct owner/group, user needs a Ranger deployment.

CDPD-42691: During the upgrade - all pipelines will be closed when the upgrade is finalized on SCM, temporarily bringing the cluster to a read-only state.

When you execute the finalize command, the cluster will temporarily go into a read-only state.

CDPD-42945: When many EC buckets are created with different EC chunk sizes, it creates pipeline for each chunk size. As a result, large number of pipelines are created in the system.

None

OPSAPS-60721: Ozone SCM Primordial Node ID is a required field which needs to be specified with one of the SCM hostnames during Ozone HA installation. In Cloudera Manager this field is not mandatory during Ozone deployment. Tthis can cause end users to continue further with installation which causes startup to fail in Ozone services.

During Ozone HA installation make sure that Ozone SCM Primordial Node ID is specified with one of the SCM hostname.

HDDS-4209: S3A Filesystem does not work with Ozone S3 in file system compact mode. When you create a directory, the S3A filesystem creates an empty file. When the ozone.om.enable.filesystem.paths parameter is enabled, the hdfs dfs -mkdir -p s3a://b12345/d11/d12 command runs successfully. However, running the hdfs dfs -put /tmp/file1 s3a://b12345/d11/d12/file1 command fails with an error: ERROR org.apache.hadoop.ozone.om.request.key.OMKeyCreateRequest: Key creation failed.

The HDDS-4209 Jira fixes the file system semantics and management in Ozone. On top of the flat name structure, which is Pure Object store, as a workaround the Hierarchical namespace structure is added. This ensures S3A compatibility with Ozone.

CDPD-41539: The "No such file or directory" imessage is s displayed when EC file is read using older ofs client.

You must upgrade the client before trying to read the key: vol1/ecbuck1/1GB_ec".

CDPD-42832:With this issue, any long running setup or a production server results in data corruption due to inconsistency issues. This may result in major issues with the existing Legacy layout type.

FSO provides atomicity and consistency guarantees for the path (dir or file) rename/delete operations irrespective of the large sub-dirs/files contained in it. These capabilities help to make the long running test more consistent without any failures so far. Recommendation is to run bigdata HCFS workloads using the FSO bucket layout types.

OPSAPS-63999: In the newly installed cluster, the Finish upgrade option is clickable.

None

CDPD-45932: Investigate impersonation with "is admin" in Ozone WebUIs /logLevel servlet endpoint

In a secure Kerberized cluster, due to an impersonation issue, changing log levels using Knox on the corresponding endpoint of the WebUI does not work. Note that this is only true, when the WebUI is accessed using Knox Other means of changing log levels in Ozone services are not affected by this problem.

None.

CDPD-74201: ozone sh key list prints the --all and --length options twice. This is a listing issue.

None.

CDPD-74016: Running Ozone sh token print on token generated using ozone dtutil command results in Null Pointer Exception.

Print the token generated by dtutil using dtutil.

CDPD-74013: Ozone dtutil get token fails when using o3 or ofs schemas.

Use ozone sh token get to get a token for ozone file system.

CDPD-63144: Key rename inside the FSO bucket fails and discplays the Failed to get parent dir error. This happens when running impala workloads with ozone.

None.

CDPD-74045: Destination cluster's Ozone services stops running when 30 replication policies or more are run concurrently with large number of files.

None.

CDPD-74331: Key put fails and displays the Failed to write chunk error when there is a volume failure during configuration.

None.

CDPD-74475: YCSB test with Hbase on Ozone degrades performance.

None.

CDPD-74483: Ozone replication from source to destination fails when ozone.client.bytes.per.checksum is set to 16KB.

None.

CDPD-74884: Exclusive size of snapshot is always 0 when you run the info command on the Ozone snapshots. It is a statistics issue and does not impact the functionality of Ozone snapshot.

None.

CDPD-75042: AWS CLI rm or delete command fails to delete all files and directories on the Ozone FSO bucket. It only deletes the leaf node.

None.

CDPD-75204: Namenode restart fails after dfs.namenode.fs-limits.max-component-length is set to a lower value and there is existing data present which exceeds the length limit.

Increase the value for the dfs.namenode.fs-limits.max-component-length parameter and restart the namenode.

CDPD-75635: Ozone write fails intermittently as SCM remains in safemode.

You must wait for SCM to come out of samemode or exit from safemode through CLI options.

CDPD-74257: For the 7.3.1 release, the Heatmap feature is disabled and removed from the Recon UI and the Solr health check is removed from the Recon Overview page.

If the APIs are called, heatmap APIs will block threads if the javax.security.auth.useSubjectCredsOnly property is set to true.

To check if javax.security.auth.useSubjectCredsOnly is set to true, you can run the sudo -u hdfs /usr/java/jdk1.8.0_232-cloudera/bin/jcmd <PID> VM.system_properties | grep -i subject command where PID must be replaced with Recon process id.

To identify the PID, run ps -ef and grep for Recon to find the Java process running Recon.

CDPD-75958/CDPD-75954: The ozone debug ldb command and ozone auditparser fails with java.lang.UnsatisfiedLinkError.

For information on workaround, see the Changing /tmp directory for CLI tools documentation.

OPSAPS-72144: During the Finalize Upgrade process, the SCM user and ozone.keytab is used by default. This causes access issues in using a custom Kerberos principal. For example, scmFoo, results in an Access denied error, as the SCM superuser privilege is required.

Finalize Upgrade command fails with an access denied error for scm/host@DOMAIN. Logs indicate the use of kinit with the default SCM user rather than the custom Kerberos principal.

You must add the Custom Kerberos Principal to Ozone Administrators.

Log in to Cloudera Manager UI.
Navigate to Clusters.
Select the Ozone service.
Go to Configurations.
Search for ozone.administrators.
Add the short form of the custom Kerberos principal. For example, scmFoo without the domain suffix
Click Save Changes.
Restart the Ozone service.
Run the Finalize Upgrade command.

OPSAPS-71878: Ozone fails to restart during cluster restart and displays the error message: Service has only 0 Storage Container Manager roles running instead of minimum required 1.

You must open Cloudera Manager on the second browser and restart the Ozone service separately.
After the Ozone service restarts, you can resume the cluster restart from the first browser.