Known issues in Ozone parcel 718.1.0

You must be aware of the known issues and limitations, the areas of impact, and workaround in Ozone parcel.

Tez Configuration Changes

The following configuration changes have to be made to pick up the latest Ozone FS jar from the Ozone parcel (when installed):

For tez.cluster.additional.classpath.prefix, the value is /var/lib/hadoop-hdfs/* (Tez Additional Classpath)
For tez.user.classpath.first, the value is true (Tez Client Advanced Configuration Snippet (Safety Valve) for tez-conf/tez-site.xml)
For tez.cluster.additional.classpath.prefix the value is /var/lib/hadoop-hdfs/* (Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml)
For tez.user.classpath.first, the value is true (Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml)

Restart the TEZ and HIVE_ON_TEZ service as prompted by Cloudera Manager.

Update Yarn to use updated Ozone FS Jar

CDPD-48500: Ozone parcel activation or install should handle redeployment of YARN jars and clean-up cache.
Perform the following steps:
  1. Log in to Cloudera Manager UI
  2. Navigate to Clusters
  3. Select the YARN service
  4. Click Actions
  5. Click Install YARN Service Dependencies
  6. Click YARN MapReduce Framework JARs
  7. Restart the CDP 7.1.8 cluster

Other issues

CDPD-56006: On providing an incorrect hostname/service ID in the Ozone URI, the filesystem client instead of failing, retries till exhaustion and the default retry is too high.
Configure ozone.client.failover.max.attempts to a lower the value to avoid long endless retries.
CDPD-49137: Sometimes OM's kerberos token is not updated and it stops being able to communicate with SCM. When this occurs, writes will start to fail.
Restarting OM or setting the safety valve hadoop.kerberos.keytab.login.autorenewal.enabled = true will fix the issue.
CDPD-49808: Spark jobs against Ozone intermittently fail with ERROR spark.SparkContext: [main]: Error initializing No filter named.
This is an intermittent failure which can be retried.
CDPD-49918: ECContainerReconstructionThread hits a precondition failure for checksum validations. This results in Reconstruction thread failing.
CDPD-50678: Deleting containers which have one or more replicas which are not empty on the Datanode can cause the container to be stuck in a deleting state indefinitely. Containers in this state can also block decommission or maintenance operations completing.
CDPD-50665: Due to HDDS-8171, if a datanode is IN_MAINTENANCE and stopped, and another node is decommissioned which is hosting an EC replica of the same index as the one on the IN_MAINTENANCE node, decommission can get blocked.
Start the datanode process on the IN_MAINTENANCE node so its replicas are available again.
CDPD-48932: Datanode needs off heap memory due to direct buffers and rocks DB usage. This can lead to the memory footprint to be larger than the configured physical memory on the node.
Remove other memory intensive services from the node, make sure the physical memory is sufficient for the planned load and scale.