Known Issues in Cloudera Manager 7.8.1

Known issues in Cloudera Manager 7.8.1

Known Issues for Installation and Upgrade of CDP Private Cloud Data Services 1.4.1

OPSAPS-67152: Cloudera Manager does not allow you to update some configuration parameters.

Cloudera Manager does not allow you to set to "0" for the dfs_access_time_precision and dfs_namenode_accesstime_precision configuration parameters.

You will not be able to update dfs_access_time_precision and dfs_namenode_accesstime_precision to "0". If you try to enter "0" in these configuration input fields, then the field gets cleared off and results in a validation error: This field is required.

To fix this issue, perform the workaround steps as mentioned in the KB article.

If you need any guidance during this process, contact Cloudera support.

Cloudera bug: OPSAPS-59764: Memory leak in the Cloudera Manager agent while downloading the parcels.

When using the M2Crpyto library in the Cloudera Manager agent to download parcels causes a memory leak.

The Cloudera Manager server requires parcels to install a cluster. If any of the URLs of parcels are modified, then the server provides information to all the Cloudera Manager agent processes that are installed on each cluster host.

The Cloudera Manager agent then starts checking for updates regularly by downloading the manifest file that is available under each of the URLs. However, if the URL is invalid or not reachable to download the parcel, then the Cloudera Manager agent shows a 404 error message and the memory of the Cloudera Manager agent process increases due to a memory leak in the file downloader code of the agent.

To prevent this memory leak, ensure all URLs of parcels in Cloudera Manager are reachable. To achieve this, delete all unused and unreachable parcels from the Cloudera Manager parcels page.

OPSAPS-65365 Do not use the $ character in the password for the custom Docker Repository for ECS installations.

Ensure that the $ character is not part of the Docker Repository password.

OPSX-2713: PVC ECS Installation: Failed to perform First Run of services.

If an issue is encountered during the Install Control Plane step of ECS Cluster First Run, installation will be re-attempted infinitely rather than the command failing.

Since the control plane is installed and uninstalled in a continuous cycle, it is often possible to address the cause of the failure while the command is still running, at which point the next attempted installation should succeed. If this is not successful, abort the First Run command, delete the Containerized Cluster, address the cause of the failure, and retry from the beginning of the Add Cluster wizard. Any nodes that are re-used must be cleaned before re-attempting installation.

OPSX-3359: ECS Upgrade failure

You may see the following error message during the Upgrade Cluster > Reapplying all settings > kubectl-patch step:

kubectl rollout status deployment/rke2-ingress-nginx-controller -n kube-system --timeout=5m
error: timed out waiting for the condition

If you see this error, do the following:

Check whether all the Kubernetes nodes are ready for scheduling. Run the following command from the ECS Server node:

kubectl get nodes

You will see output similar to the following:

NAME STATUS ROLES AGE VERSION
<node1> Ready,SchedulingDisabled control-plane,etcd,master 103m v1.21.11+rke2r1
<node2> Ready <none> 101m v1.21.11+rke2r1
<node3> Ready <none> 101m v1.21.11+rke2r1
<node4> Ready <none> 101m v1.21.11+rke2r1

Run the following command from the ECS Server node for the node showing a status of SchedulingDisabled:
```
kubectl uncordon 
```
You will see output similar to the following:
```
<node1>node/<node1> uncordoned
```
Scale down and scale up the rke2-ingress-nginx-controller pod by running the following command on the ECS Server node:
```
kubectl delete pod  rke2-ingress-nginx-controller-<pod number> -n kube-system
```
Resume the upgrade.

OPSX-3547: ECS upgrade is taking 10+ hours to complete on 25 nodes cluster

The worst case scenario in rolling restart during upgrade takes around 24 minutes per node on clusters with 25 nodes.

During the upgrade, If you see that the stop operation on a single node takes longer than 25 minutes or starting a node takes longer than 10 minutes, you can configure Cloudera Manager to reduce the default timeouts by decreasing the value of the timeout parameters listed below to speed up the upgrade. (To change the configuration In the Cloudera Manager Admin Console, go to the ECS service, click the Configuration tab, and search for the parameter.)

The stop operation on a single node has the following steps:

Graceful drain of the node
This process has a default timeout of 10 minutes, controlled by the Cloudera Manager configuration parameter DRAIN_NODE_TIMEOUT.
Non-graceful drain of the node
This process has a default timeout of 10 minutes, controlled by the Cloudera Manager configuration parameter DRAIN_NODE_TIMEOUT.
Wait for the workloads to spawn on other nodes in the cluster.
This process has a default timeout of 10 minutes, controlled by the Cloudera Manager configuration parameter WAIT_TIME_FOR_NODE_READINESS.

The start operation has the following steps:

Uncordon the node
There is no timeout parameter for this step.
Wait for the workloads to spawn on the node
This process has a default timeout of 10 minutes, controlled by the Cloudera Manager configuration parameter WAIT_TIME_FOR_NODE_READINESS.

OPSX-3550 Incorrect status on CDP Private Cloud Data Services services page in the Cloudera Manager Admin Console while ECS is upgrading

The Cluster page might show that the Upgrade failed while an upgrade is in progress.

Please check the Upgrade Command for the status of the upgrade. The Cluster page will reflect the new version once the upgrade command is complete.

OPSX-735: Kerberos service should handle Cloudera Manager downtime

The Cloudera Manager Server in the base cluster must be running in order to generate Kerberos principals for Private Cloud. If there is downtime, you may observe Kerberos-related errors.

Resolve downtime on Cloudera Manager. If you encountered Kerberos errors, you can retry the operation (such as retrying creation of the Virtual Warehouse).