Known Issues in Cloudera Data Services on premises 1.5.4 SP2

OPSX-5776 and OPSX-5747 - ECS - Some of the rke2-canal DaemonSet pods in the kube-system namespace are stuck in Init state causing longhorn volume attach issues

In a few cases, after upgrading from Cloudera Data Services on premises 1.5.3 or 1.5.3-CHF1 to 1.5.4 SP2, a pod that belongs to rke2-canal DaemonSet is stuck in Init status. This causes some pods in kube-system and longhorn-system namespaces to be in Init or CrashLoopBackOff status. This manifests as volume attach failure in the embedded-db-0 pod in CDP namespace, and causes some pods in CDP namespace to be in CrashLoopBackOff state.

Perform a rolling restart of rke2-canal DaemonSet by running the following command:
```
kubectl rollout restart ds rke2-canal -n kube-system
```
Monitor the DaemonSet restart status by running the following command:
```
kubectl get ds -n kube-system
```
After the rke2-canal DaemonSet restart is complete, if any pods in DaemonSets within the longhorn-system namespace remain in Init or CrashLoopBackOff state, perform a rolling restart of those DaemonSets. Choose the appropriate command based on the specific DaemonSet that is failing. If more than one DaemonSet requires a restart, restart them sequentially, one at a time.
```
kubectl rollout restart ds longhorn-csi-plugin -n longhorn-system
kubectl rollout restart ds longhorn-manager -n longhorn-system
kubectl rollout restart ds engine-image-ei-6b4330bf -n longhorn-system
kubectl rollout restart ds engine-image-ei-ea8e2e58 -n longhorn-system 
```
Monitor the DaemonSet restart status by running the following command:
```
kubectl get ds -n longhorn-system
```
note
After the DaemonSet restart completes, it can take another ten minutes for pods in longhorn-system and CDP namespaces to come back to Running status.

OPSX-5903 - 1.5.3 to 1.5.4 SP2 ECS - The upgrade fails when the rke2-ingress-nginx-controller system exceeds its progress deadline.

Upgrade fails while running the following command:

kubectl rollout status deployment/rke2-ingress-nginx-controller -n kube-system --timeout=5m

Run the following command:

kubectl rollout status deployment/rke2-ingress-nginx-controller -n kube-system --timeout=5m

Run the command for Refresh ECS by performing the following steps:

Find the number of replicas for the deployment:

kubectl get deployment -n kube-system rke2-ingress-nginx-controller

Scale it down to 0:

kubectl scale deployment rke2-ingress-nginx-controller -n kube-system --replicas=0

Once all associated pods are terminated scale it back:

kubectl scale deployment rke2-ingress-nginx-controller -n kube-system --replicas=<n>

Resume upgrade.

OPSAPS-72270 - Start ECS command fails on uncordon nodes step

In an ECS HA cluster sometimes, the server node restarts during start up. This causes the uncordon step to fail.

Run the following command on the same node to verify whether the kube-apiserver is ready:

kubectl get pods -n kube-system | grep kube-apiserver

Resume the command from the Cloudera Manager UI.

OPSAPS-72964 and OPSAPS-72769: Unseal Vault command fails after restarting the ECS service

Unseal Vault command fails sometimes, after restarting the ECS service.

It may take sometime for the ECS cluster to be up and running after a restart operation. In case Unseal Vault fails after the restart operation please follow the below steps:

Verify that the pod vault-0 in the vault-system namespace is running.
Once it is in Running state, initiate the Unseal Vault command from the ECS service's Action menu.

OPSX-5986: ECS fresh install failing with helm-install-rke2-ingress-nginx pod failing to come into Completed state

ECS fresh install fails at the "Execute command Reapply All Settings to Cluster on service ECS" step due to a timeout waiting for helm-install. To confirm the issue, run the following kubectl command on the ECS server host to check if the pod is stuck in a running state:

kubectl get pods -n kube-system | grep helm-install-rke2-ingress-nginx

To resolve the issue, manually delete the pod by running the following command:

kubectl delete pod <helm-install-rke2-ingress-nginx-pod-name> -n kube-system

Then, click Resume to proceed with the fresh install process on the Cloudera Manager UI.