How to fix errors detected by Pre-Upgrade Checks

This section provides troubleshooting steps for common errors detected during the pre-upgrade checks.

Download Upgrade Validator

Problem: "Failed to Distribute to an ecs host"

Steps to resolve:

  1. Verify if the Cloudera Manager agent on the failed host is having any issues communicating with the Cloudera Manager server.
  2. Click View Command Details → Retry the command again if you have verified issues seen in the logs from the above steps are resolved.
  3. Check the Cloudera Manager server logs for any error logs.

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Control Plane Health Checks

Vault

Problem: "Failed to get Vault pod name/No Running Vault Pod Found"

No running Vault pod found in the specified namespace.

Steps to resolve:
  1. Check if Vault pods are running:
    kubectl get pods -n vault-system -l app.kubernetes.io/name=vault
    
  2. If no pods are running, restart Vault:
    kubectl rollout restart statefulset vault -n vault-system
    

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Problem: "Failed to Execute Command in Vault Pod"

Failed to execute command in pod:
<error_message>, stderr: <stderr_output>

Steps to resolve:

  1. Restart Vault by executing the following command:
    kubectl rollout restart statefulset vault -n vault-system

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Problem: "Vault is sealed"

Vault is in a sealed state.

Steps to resolve:

  1. Log in to the Cloudera Manager Admin Console.
  2. Navigate to the ECS Cluster and select the ECS cluster in Cloudera Manager UI.
  3. Access Actions: Click on the Actions menu.
  4. Unseal the Vault: Choose Unseal Vault from the Actions menu.

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Longhorn

Problem: "Failed to list Longhorn volumes"

The Longhorn volume resource could not be retrieved due to API server issues or Longhorn component failures.

Steps to resolve:

  1. Verify that the Longhorn volumes are accessible:
    kubectl get volumes.longhorn.io -n longhorn-system
  2. Check if the Longhorn manager and related pods are running:

    kubectl get pods -n longhorn-system
  3. If any Longhorn pods are in CrashLoopBackOff or Pending states, restart them:
    kubectl delete pod <pod-name> -n longhorn-system
  4. If the issue persists, restart ECS services.

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Problem: Unhealthy Longhorn volumes detected

Some Longhorn volumes are in a degraded, detached, or error state.

Steps to resolve:
  1. Check the status of Longhorn volumes:
    
    kubectl get volumes.longhorn.io -n longhorn-system -o jsonpath='{range .items[*]}{.metadata.name} {.status.robustness}{"\n"}{end}'
  2. If a volume is degraded or faulted, check volume details:
    kubectl describe volumes.longhorn.io <volume-name> -n longhorn-system
  3. If a volume is detached, attempt to attach it:
    kubectl patch volumes.longhorn.io <volume-name> -n longhorn-system --type='merge' -p '{"spec":{"frontend":"blockdev"}}'
  4. Restart the Longhorn manager if necessary:
    kubectl rollout restart daemonset longhorn-manager -n longhorn-system

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Problem: Longhorn PVCs are not bound

PersistentVolumeClaims (PVCs) using Longhorn are not in a Bound state.

Steps to reolve:

  1. List all PVCs in the cluster:
    kubectl get pvc --all-namespaces
  2. If a PVC is stuck in Pending state, check the corresponding PersistentVolume (PV):
    kubectl get pv
  3. Describe the problematic PVC to identify binding issues:
    kubectl describe pvc <pvc-name> -n <namespace>
  4. If the PV is Released but not Available, manually delete it:
    kubectl delete pv <pv-name>
  5. Restart the Longhorn services if necessary:
    kubectl rollout restart daemonset longhorn-manager -n longhorn-system

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

RKE2

Problem: "Failed to check API server"

API server is unreachable.

Steps to resolve:

  1. Verify the API server endpoint:
    kubectl cluster-info
  2. Check if the API server pods are running:
    kubectl get pods -n kube-system -l component=kube-apiserver
  3. Stop Cloudera Embedded Container Service.
  4. Reboot hosts.
  5. Start Cloudera Embedded Container Service.

    The start command fails with the following error message:

    Timed out waiting for kube-apiserver to be ready

    Option 1:

    Start each master role instance individually without waiting for each node to be up and running.

    Option 2:

    If Option 1 does not work, follow the steps from SUSE to recover the cluster:https://docs.rke2.io/backup_restore#cluster-reset

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Problem: One or more nodes are in a NotReady state.

Steps to resolve:

  1. Check node status:
    kubectl get nodes -o wide
  2. Describe the problematic node:
    kubectl describe node <node-name>
  3. Stop all roles on the affected host.
  4. If the node is not required, remove it from the cluster.
  5. Reboot the host.
  6. Restart all roles on the host.

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Pods in kube-system are unhealthy

Problem: Critical control plane or system pods are failing.

Steps to resolve:

  1. Check pod statuses in kube-system:
    kubectl get pods -n kube-system
  2. If any pods are in CrashLoopBackOff, check logs:
    kubectl logs <pod-name> -n kube-system
  3. Restart unhealthy pods:
    kubectl delete pod <pod-name> -n kube-system
  4. If the issue persists, restart ECS services.

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Problem: Control Plane Pod XXX Readiness, containers with unready status

Steps to resolve:

  1. Try to scale down and scale up the controller of the pod (Typically deployment or Replicaset).
  2. If the problem persists, look for the reason of failure with the following commands:
    kubectl get events -n <pod-namespace>
    kubectl describe pod <pod-name> -n <pod-namespace>
    kubectl logs <pod-name> -n <pod-namespace> -c <container-name>

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––--

Docker Registry Health Checks

Problem: Docker registry connection failed and it’s been verified the new ecs-toleration-webhook image required for upgrade is in docker registry.

You may see an error message similar to:

2025/02/27 13:11:04 Docker registry connection failed. Unable to obtain ecs-toleration-webhook manifest: Status Code: 400, Response: {"errors":
[{"code":"MANIFEST_INVALID","message":"manifest invalid","detail":{"DriverName":"filesystem","Enclosed":
{"Op":"mkdir","Path":"/var/lib/registry/docker/registry/v2/blobs/sha256/a3/a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4","Err":28}}}]}

This error is due to the fact that your docker registry host is running out of disk space. When we request information about an image manifest, if the information is not available in cache, then the registry would need to construct it again, and during that process it may need to create temporary files or directories. To resolve this issue, either add more disk space to your docker registry host or clean up the registry and re-download the required images.