Troubleshooting Cloudera DataFlow upgrade errors

Setting up kubectl to connect to the Kubernetes cluster🔗

It is helpful to have access to the Kubernetes cluster using command line tools such as kubectl when you are troubleshooting deployment or upgrade failures. To set up kubectl access, follow these steps:

In , from the Environments page, select the service for which you want to add or remove user access.
Click Manage DataFlow.
From the Actions menu, click Manage Kubernetes API Server User Access.
Add the AWS IAM role that you will authenticate as to the list of authorized users for DataFlow by entering the ARN in the Add User dialog.
Use the Download Kubeconfig action to retrieve the kubeconfig file for connecting to your cluster.

Set up your kubectl to use the downloaded kubeconfig file:

export KUBECONFIG=[***PATH/TO/DOWNLOADED/KUBECONFIG/FILE***]

Run kubectl get ns and validate that your output looks similar to:

With kubectl being set up correctly, you are able to access NiFi and other logs directly through the CLI.

Understanding upgrade failures🔗

There are various reasons that may prevent an upgrade from happening. This document covers the following scenarios:

The upgrade fails upon initiation.
The upgrade starts but fails, and rollback is possible.
The upgrade starts but fails and rollback is not possible.

Upgrade fails on initiation🔗

Symptom: When the upgrade fails to start, the status of the environment does not change to Upgrading and you receive an error message stating the cause of the failure.

Possible causes: The service is in one of the following states:

The service is in one of the following states:
- Failed to Update
- Failed to Update NiFi version
- Any transitional / in-progress state
Any of the deployments is in one of the following states:
- New
- Failed to Update
- Failed to Update NiFi version
- Any transitional / in progress state

How to fix it: Fix the error indicated in the error message, then retry the upgrade. If the upgrade still fails, contact Cloudera Support.

Upgrade fails and rollback is possible🔗

Symptom: Upgrade rollback process starts. If the rollback is successful, the Status of the service returns to Good Health, Cloudera DataFlow and NiFi versions return to their pre-upgrade values. If the rollback fails, the Status of both the Cloudera DataFlow service and running deployments becomes Failed to Upgrade. The order in which this transition happens depends on where the rollback actually failed.

How to fix it: In the event of an upgrade failure, Cloudera DataFlow determines if rollback is possible and automatically initiates it. No user intervention is required.

If the rollback is successful:

After the rollback ends, from the Environments page, select the Cloudera DataFlow service you were trying to upgrade and click the Alerts tab. Check Event History to see if there is anything obvious in there or something that narrows down the possible cause of the failure.
From the Deployments page select running deployments that failed to upgrade and check Alerts>Event History to see if there is anything obvious in there or something that narrows down the possible cause of the failure.
Check the K8s cluster using CLI and try to fix errors based on the information you obtained from Event History in the previous steps.
Retry the upgrade.
If the upgrade still fails contact Cloudera Support.

If the rollback fails:

The status of both the service and the running deployments changes to Failed to Update. The order in which they transition depends on where the rollback actually failed.

Upgrade fails and rollback is not possible🔗

Symptom: The status of the service, and shortly after the deployments, changes to Failed to Upgrade. The status icon on the service changes to Bad Health.

How to fix it:

If K8s upgrade failed, check the Event History if there is anything obvious in there.
Check on the cloud provider side whether there is anything obvious as a reason for the failure. For example, check activity logs and node pools on Azure, or EKS cluster on AWS.
If the K8s upgrade succeeded, but the upgrade failed somewhere later, then the previous point in the "Upgrade fails and rollback is possible" can follow here (check the events and then the K8s cluster).
If you managed to pinpoint and fix the possible cause of the failure, retry the upgrade.
If you did not find the cause of the failure, or the upgrade still fails upon retry, contact Cloudera Support.

Retrying an upgrade🔗

On the Cloudera DataFlow Overview select Environments from the left navigation pane.
Select the service where you want to retry the upgrade and click Manage DataFlow.
Click Actions > Retry Upgrade DataFlow.