Known Issues in Cloudera Data Services on premises 1.5.4 SP2
The following are the known issues in the 1.5.4 SP2 release of Cloudera Data Services on premises.
- OPSX-5776 and OPSX-5747 - ECS - Some of the rke2-canal DaemonSet pods in the kube-system namespace are stuck in Init state causing longhorn volume attach issues
- In a few cases, after upgrading from Cloudera Data Services on premises
1.5.3 or 1.5.3-CHF1 to 1.5.4 SP2, a pod that belongs to rke2-canal
DaemonSet is stuck in Init status. This causes some
pods in kube-system and longhorn-system namespaces to be in Init or
CrashLoopBackOff status. This manifests as volume attach failure in
the embedded-db-0 pod in CDP namespace, and causes some pods in CDP
namespace to be in CrashLoopBackOff state.
- Perform a rolling restart of rke2-canal DaemonSet by running the
following
command:
kubectl rollout restart ds rke2-canal -n kube-system
- Monitor the DaemonSet restart status by running the following
command:
kubectl get ds -n kube-system
- After the rke2-canal DaemonSet restart is complete, if any pods
in DaemonSets within the
longhorn-system
namespace remain inInit
orCrashLoopBackOff
state, perform a rolling restart of those DaemonSets. Choose the appropriate command based on the specific DaemonSet that is failing. If more than one DaemonSet requires a restart, restart them sequentially, one at a time.kubectl rollout restart ds longhorn-csi-plugin -n longhorn-system kubectl rollout restart ds longhorn-manager -n longhorn-system kubectl rollout restart ds engine-image-ei-6b4330bf -n longhorn-system kubectl rollout restart ds engine-image-ei-ea8e2e58 -n longhorn-system
- Monitor the DaemonSet restart status by running the following
command:
kubectl get ds -n longhorn-system
- Perform a rolling restart of rke2-canal DaemonSet by running the
following
command:
- OPSX-5903 - 1.5.3 to 1.5.4 SP2 ECS - The upgrade fails when the rke2-ingress-nginx-controller system exceeds its progress deadline.
- Upgrade fails while running the following command:
kubectl rollout status deployment/rke2-ingress-nginx-controller -n kube-system --timeout=5m
- OPSAPS-72270 - Start ECS command fails on uncordon nodes step
-
In an ECS HA cluster sometimes, the server node restarts during start up. This causes the uncordon step to fail.
- OPSAPS-72964 and OPSAPS-72769: Unseal Vault command fails after restarting the ECS service
- Unseal Vault command fails sometimes, after restarting the ECS service.
- OPSX-5986: ECS fresh install failing with helm-install-rke2-ingress-nginx pod failing to come into Completed state
-
ECS fresh install fails at the "Execute command Reapply All Settings to Cluster on service ECS" step due to a timeout waiting for helm-install. To confirm the issue, run the following kubectl command on the ECS server host to check if the pod is stuck in a running state:
kubectl get pods -n kube-system | grep helm-install-rke2-ingress-nginx