Upgrading Cloudera Data Services on premises using Cloudera Embedded Container Service

Learn about upgrading Cloudera Data Services on premises with the Cloudera Embedded Container Service. You can upgrade Cloudera Data Services on premises to current version without uninstalling your existing installation.

Before upgrading to Cloudera Data Services on premises 1.5.5 SP2 or higher versions with Data Lake 7.3.1, perform the actions described in Cloudera Data Engineering Prerequisites to ensure compliance with the Spark support matrix.
Review the Software Support Matrix for Cloudera Embedded Container Service.
The Docker registry that is configured with the cluster must remain the same during the upgrade process. To use a different configuration for the Docker registry, you must perform a new installation of Cloudera Data Services on premises.
To upgrade Cloudera Data Engineering, make sure that you have performed Pre-upgrade - Upgrading Cloudera Data Engineering service and taken a backup of the Cloudera Data Engineering service.
If you have expanded the cdp-embedded-db volume from Longhorn UI after the initial installation of your Data Services on premises cluster, then you must perform the workaround steps before planning your upgrade to Cloudera Data Services on premises 1.5.5 SP2 or higher to avoid a potential upgrade failure. For more information about the workaround details, see Overcoming a possible upgrade failure.

How to video

Watch this video to learn the essential steps to successfully upgrade your Cloudera Data Services on premises from version 1.5.5 to 1.5.5 SP1 using Cloudera Embedded Container Service and with Internet as the method. This service pack provides fixes, addresses CVEs, and is the first Cloudera Data Services on premises release to support the 7.3.1 base.

In Cloudera Manager, navigate to Cloudera on premises and click the icon, then click Update.
On the Getting Started page, you can select the Install method - Air Gapped or Internet and proceed. For more information on the pre-upgrade checklist, see Pre-upgrade checklist.

Internet install method

Air Gapped install method

Click Next.
On the Collect Information page, click Next.
important
If you are using a custom docker repository, you have an option to clear your obsolete Docker images. Note that this is a post-upgrade clean up option available only when you selected the custom docker repository during your installation.
- As part of the upgrade process workflow, this step is the only opprtunity for you to generate a post-upgrade cleanup script.
- Must be run only after the successful upgrades of all Cloudera Embedded Container Service clusters using the same custom docker repository.
On the Install Parcels page, click Next.
On the Update Data Services page, you can see the progress of your upgrade. Post Upgrade Validation is run and these are a set of health checks added as the last step in the upgrade to verify that the upgrade was successful and the cluster is healthy. This includes verifying that the services are healthy, the hosts are upgraded to the new version, and that the control plane is in a healthy state after upgrade. For the control plane to be determined as good health, it will verify that RKE2, Longhorn, and Vault are healthy. It will also check that the critical pods in the control plane namespace are up and running.
Click Next after the upgrade is complete .
note
The upgrade might occasionally fail with error messages or conditions such as the following:

Error message: During the following step: Execute command Install Tolerations Webhook on service ECS-3 the Upgrade progress page mentions a failure waiting for kube-proxy to come up.
Workaround:

Log in using ssh to one of the Cloudera Embedded Container Service Server nodes and run the following command:
/var/lib/rancher/rke2/bin/kubectl get nodes
The output looks similar to the following:
NAME STATUS ROLES AGE VERSION ecs-abc-1.vpc.myco.com Ready control-plane,etcd,master 4h50m v1.21.8+rke2r2 ecs-abc-2.vpc.myco.com NotReady <none> 4h48m v1.20.8+rke2r1 ecs-abc-3.vpc.myco.com Ready <none> 4h48m v1.21.8+rke2r2 ecs-abc-4.vpc.myco.com NotReady <none> 4h48m v1.20.8+rke2r1 ecs-abc-5.vpc.myco.com NotReady <none> 4h48m v1.20.8+rke2r1
If any of the version numbers in the last column are lower than the expected version, reboot those nodes. (For example, v1.20.8 in the output above.)

In the Command Output window, in the step that failed, click Resume.

Upgrade hangs on the Execute command Post upgrade configuration on service ECS step for more than an hour.
Workaround:

Log in to one of the Cloudera Embedded Container Service server nodes and run the following command:
kubectl get nodes
The output looks similar to the following:
NAME STATUS ROLES AGE VERSION ecs-abc-1.vpc.myco.com Ready control-plane,etcd,master 3h47m v1.21.11+rke2r1 ecs-abc-2.vpc.myco.com NotReady <none> 3h45m v1.21.8+rke2r2 ecs-abc-3.vpc.myco.com NotReady <none> 3h45m v1.21.8+rke2r2 ecs-abc-4.vpc.myco.com NotReady <none> 3h45m v1.21.8+rke2r2
If you any nodes display a status of NotReady, click the Abort option in the command output window.

Reboot all nodes showing NotReady.

Check the node status again as shown above. After all the nodes show Ready, click the Resume option in the command output window to continue with the upgrade.
After the upgrade is complete, the Summary page appears. You can now Launch Cloudera on premises from here.

If you see a Longhorn Health Test message about a degraded Longhorn volume, wait for the cluster repair to complete.

Or you can navigate to the Cloudera Data Services on premises page and click Open Cloudera on premises.

Cloudera Data Services on premises opens in a new window.

If the upgrade stalls, do the following:
1. Check the status of all pods by running the following command on the server node:
```
export PATH=$PATH:/opt/cloudera/parcels/ECS/installer/install/bin/linux/:/opt/cloudera/parcels/ECS/docker
export KUBECONFIG=~/kubeconfig

kubectl get pods --all-namespaces
```
2. If there are any pods stuck in "Terminating" state, then force terminate the pod using the following command:
```
kubectl delete pods <NAME OF THE POD> -n <NAMESPACE> --grace-period=0 —force 
```
  If the upgrade still does not resume, continue with the remaining steps.
3. If there are any pods in the "Pending" state, then you can try to reschedule the pods in the "Pending state" by restarting the yunikorn-scheduler. Run the following commands to restart yunikorn-scheduler:
```
kubectl get pods -n yunikorn

kubectl get deploy -n yunikorn

kubectl scale --replicas=0 -n yunikorn deployment/yunikorn-scheduler

kubectl get deploy -n yunikorn

kubectl scale --replicas=1 -n yunikorn deployment/yunikorn-scheduler

kubectl get deploy -n yunikorn
```
4. In the Admin Console, go to the service and click Web UI > Storage UI.
  The Longhorn dashboard opens.
5. Check the "In Progress" section of the dashboard to see whether there are any volumes stuck in the attaching/detaching state in. If a volume is that state, reboot its host.
6. In the LongHorn UI, go to the Volume tab and check if any of the volumes are in the "Detached" state. If any are in the "Detached" state, then restart the associated pods or reattach them to the host manually.

You may see the following error message during the Upgrade Cluster > Reapplying all settings > kubectl-patch :
```
kubectl rollout status deployment/rke2-ingress-nginx-controller -n kube-system --timeout=5m
error: timed out waiting for the condition
```
If you see this error, do the following:
1. Check whether all the Kubernetes nodes are ready for scheduling. Run the following command from the ECS Server node:
```
kubectl get nodes
```
  You will see output similar to the following:
```
NAME STATUS ROLES AGE VERSION
<node1> Ready,SchedulingDisabled control-plane,etcd,master 103m v1.21.11+rke2r1
<node2> Ready <none> 101m v1.21.11+rke2r1
<node3> Ready <none> 101m v1.21.11+rke2r1
<node4> Ready <none> 101m v1.21.11+rke2r1
```
2. Run the following command from the ECS Server node for the node showing a status of SchedulingDisabled:
```
kubectl uncordon <node1>
```
  You must add the NODENAME to the end of the command.
  You will see output similar to the following:
```
<node1>node/<node1> uncordoned
```
3. Scale down and scale up the rke2-ingress-nginx-controller pod by running the following command on the ECS Server node:
```
kubectl delete pod  rke2-ingress-nginx-controller-<pod number> -n kube-system
```
4. Resume the upgrade.

If a new release-dwx-server pod is unable to start because of an existing release-dwx-server pod failing to start:
- Delete the pod manually by executing the following command:
```
kubectl delete -n cdp pod cdp-release-dwx-server-<pod_id>
```
- Resume the upgrade wizard if it had timed out.

After upgrading, the Cloudera Manager admin role may be missing the Host Administrators privilege in an upgraded cluster. The cluster administrator should run the following command to manually add this privilege to the role.
```
ipa role-add-privilege <cmadminrole> --privileges="Host Administrators"
```
If you specified a custom certificate, select the Cloudera Embedded Container Service cluster in Cloudera Manager, then select Actions > Update Ingress Controller . This command copies the cert.pem and key.pem files from the Cloudera Manager server host to the Cloudera Embedded Container Service Management Console host.
After upgrading, you can enable the unified time zone feature to synchronize the Cloudera Embedded Container Service cluster time zone with the Cloudera Manager Base time zone. When upgrading from earlier versions of Cloudera Data Services on premises to 1.5.5 SP2 and higher, unified time zone is disabled by default to avoid affecting timestamp-sensitive logic. For more information, see Cloudera Embedded Container Service unified time zone.