Data Lake rolling upgrades

The Data Lake rolling upgrade allows you to upgrade the Data Lake Runtime and OS without stopping attached Data Hubs or Data Services. This allows workloads to continue running during the Data Lake upgrade operation.

Similarly to the classic Data Lake upgrade logic, a Data Lake rolling upgrade first upgrades the Runtime version and then the OS.

To run a Data Lake rolling upgrade, the following requirements must be met:

The Data Lake must be Runtime version 7.2.17.300+ to perform a rolling upgrade directly to 7.2.18+.
The Data Lake OS must be RHEL 8. Rolling upgrades are not generally available from Data Lakes on CentOS.
The Data Lake must be an Enterprise Data Lake (EDL). Rolling upgrades to 7.2.18 are not generally available from medium duty or light duty Data Lakes.

If your Data Lake does not meet these requirements, you will first have to do a traditional Runtime upgrade, upgrade your OS from CentOS to RHEL, or resize your Data Lake to EDL before a rolling upgrade can be performed.


Current Runtime version	Current OS	Current Data Lake Shape	Rolling upgrade support?
7.2.17.300+	RHEL 8	EDL	Yes, directly to 7.2.18+

In some circumstances, a rolling upgrade may not be supported for a Data Lake cluster, but can be enabled through entitlement. Some cluster services might become unavailable during this type of upgrade, and running workloads could be impacted. The Data Lake upgrade UI displays information about whether a rolling upgrade is available, unavailable, or may be available under entitlement. For instructions on performing a Data Lake upgrade, including rolling upgrades, see Upgrading a Data Lake. For information about obtaining an entitlement for rolling upgrade, contact Cloudera Customer Support.

Data Lake rolling upgrade limitations and issues

The Data Lake rolling upgrade has the following limitations:

Long running CDE and CML Spark jobs might loose connectivity to the Hive Catalog and fail during and after a Data Lake rolling upgrade. We recommend stopping these jobs prior to attempting the upgrade. The jobs will work again fine after re-submitting them.
Cloudera recommends performing the upgrade outside of working hours, as user-facing UI/API endpoints may become unstable. Workloads running on Data Hubs and Data Services use different internal endpoints, so they are not affected. The impact of this is that you may not be able to view or edit Ranger permissions and the Ranger audit log, browse Atlas/Data Catalog, or make changes to Atlas at certain times during the rolling Data Lake upgrade (see Known Issues in Apache Atlas for more details on Atlas issues). If you are using custom-built applications that interact with the Data Lake using these endpoints, we recommend implementing retry logic in your clients to handle temporary unavailability of these endpoints. This is a best practice, irrespective of rolling upgrades.
Atlas Authorization may return a “403-Access Denied” in response to Atlas REST API calls. After the rolling upgrade finishes and Ranger Admin is back up, these services and their endpoints will continue to function normally.
The Ranger RAZ server becomes unreachable during and after the OS upgrade for some time, and an UnknownHostException may be seen in the RAZ client. During this period all authentication calls to the Ranger RAZ server are expected to fail.
During OS upgrades, attempts to access Knox on the host being upgraded may produce occasional 403 HTTP responses. Wait and retry the failed requests.
If Knox is HA and one of the Knox servers is down, then accessing the service through a Control Plane endpoint URL (i.e., through cloud load balancer) will take approximately 30 seconds to failover the request to the available Knox instance.
Solr supports rolling upgrades from release 7.2.18 and higher. Upgrading from a lower version means that all the Solr Server instances are shut down, parcels upgraded and activated, and then the Solr Servers are started again. This causes a service interruption of several minutes, the actual value depending on cluster size. Services like Atlas and Ranger that depend on Solr may face issues because of this service interruption.
Certain workloads may experience downtime during the Data Lake rolling upgrade operations:
- Any workloads configured to use a single HMS endpoint (Hive Warehouse Connector configurations).
- Hue File Browser may be unavailable for a short period during a rolling upgrade in a RAZ-enabled environment.
- Other clients in use in your workloads could be impacted. Cloudera recommends testing how your workloads function during Data Lake rolling upgrades, before you adopt this new feature.
- During the rolling upgrade, Hive and HBase grant and revoke commands will not function.
- Generally, if you have any workload that is using a single Data Lake service endpoint, it will likely experience a temporary outage. This may not necessarily result in a workload failure.
Rolling upgrades for an enterprise Data Lake will take longer than a classic upgrade that requires downtime. This is because OS image upgrade will be performed sequentially, node by node.
Certain operations (create, upgrade, and resume) for Data Hubs and Data Services are not recommended during a Data Lake rolling upgrade.
When upgrading Data Hub clusters to Runtime 7.2.18.100, you might encounter staleness in knox.jwt.client.gateway.address configuration in case its value points to the address of the Data Lake node. If staleness occurs after the upgrade, you need to run Deploy Client Configuration in Cloudera Manager.