Known issues in Ozone parcel 718.2.4

You must be aware of the known issues and limitations, the areas of impact, and workaround in Ozone parcel.

SSL Handshake fails between Ozone DataNodes if the two DataNodes have their certificate signed by different Ozone Storage Container Managers.

Ozone DataNode certificates are signed by the leader Storage Container Manager. Due to an issue in creating a TrustStore for DataNode to DataNode connections, the trust cannot be established between the two DataNodes if a different Storage Container Manager signs their certificate. These connections fail to establish and display an SSL Handshake Exception. This affects Pipeline creation and container replication (also EC container reconstruction). The symptoms vary, depending on the number of the nodes that have different singer certificates, either these DataNodes do not participate in any Ratis-3 Pipeline, or have Pipelines exclusively in between groups with the same signer. Over time this can lead to an imbalance in DataNode usage, and it might cause decommission of a DataNode stuck if the data has to be replicated to a node with a certificate that has a different signer.

This problem affects all the 7.1.8 Ozone Parcel releases.

To identify if the problem is present on a cluster, the output of ozone admin cert list command must be examined. Ensure you define a sufficient number of certificates to be returned with the -c option to see all the certificates issued in the system.

If there are different Issuers for the latest DataNode certificates, this indicates the cluster is affected.

It is possible to avoid the problem by checking the ozone admin scm roles output and see if the Primordial node is the actual leader node before adding a new DataNode and starting it for the first time. If the leader SCM node is a different node, then calling ozone admin scm transfer to make the Primordial node the leader can put the cluster into the desired state before adding the new DataNode.

Ensure that all the certificates in the cluster are signed by the same Storage Container Manager node.

Perform the procedure below if the cluster is affected and has DataNode certificates signed by a different issuer:

  1. Identify the current leader Storage Container Manager by running the command ozone admin scm roles.
  2. If the leader did not sign the majority of DataNode certificates, change the leader that has signed the majority of DN certificates by running the ozone admin scm transfer command with the proper Ozone SCM Service ID (set in Cloudera Manager>Ozone>Configuration) and the UUID of the desired leader node.
  3. Stop the minority of DataNodes that have a different sign than the majority.
  4. Locate the Datanode Metadata Directory (set in Cloudera Manager>Ozone>Configuration) on the hosts of the stopped DataNode and move the directory to a backup location.
  5. Start the previously stopped DataNodes.
  6. After regenerating the certificates, check if the DataNodes are joining the Ratis-3 Pipeline.

To avoid service disruptions, you can stop the DataNodes one by one instead of all together. This can cause a data outage for the downtime of the node if there are files with Ratis-1 replication on the cluster and the single replica of these files reside on the node being restarted. Changing the leader node on an SCM HA-enabled cluster should not disrupt operations.

CDPD-65369: Service Host connectivity error: Target host is not specified
Run again with confiq hive.server2.transport.mode.
CDPD-56073: Impala-shell client throws DisconnectError when trying to connect.
None.
CDPD-65477: Ozone freon ommg does not work.
None.
CDPD-65491: Recon displaying DELETED container as missing.
None.
CDPD-64844: Hadoop put operation on File System Optimised is broken.
None.
CDPD-56486: [Spark] Ozone delete key failed error during Spark job completion
None.

The list of known issues for the Ozone parcel 718.2.4 includes the known issues from Ozone parcel 7.1.8.2.2/7.1.8.2.3 also. For more information, see the Known issues in Ozone parcel 718.2.2 documentation.