Service components limitations
You should be aware of certain limitations and considerations related to service components such as Atlas, Kudu, and YARN, to ensure that your cluster can successfully complete the Zero Downtime Upgrade (ZDU) process.
Unsupported ZDU components
- 7.1.7 SP2 to 7.1.9
- Spark History Server
- Streams Messaging Manager
- Schema Registry
- Ranger RMS
- Cruise Control
- KS Indexer
- Queue Manager
- Ranger KMS
- 7.1.8 to 7.1.9
- Ranger RMS
- Spark History Server
- Cruise Control
- KS Indexer
- Streams Messaging Manager
- Queue Manager
The Rolling Restart process is unavailable for upgrading from the Cloudera Runtime 717 SP2 path. The Atlas service is stopped in the beginning of the upgrade process and restarted after the upgrade is complete. You can expect a downtime of approximately 10 minutes during the upgrade process and the Atlas REST API calls fail.
The upgrade process consists of upgrading Cloudera Manager and Cloudera Runtime. Though Atlas cannot comply with a complete ZDU process, there is no data loss. Post upgrade, all entities created before and during the upgrade process are available without any changes or modifications.
Certain Atlas limitations related to the ZDU process are as follows:
- While Atlas goes through the process of rolling upgrade, some downtime might be expected because Atlas does not support the Active-Active model. Failover takes time because Active-Passive is the currently supported model. While the Passive instance becomes Active, there is some downtime where Atlas is not reachable and the messages from clients are queued up in Kafka.
- Solr does not support Rolling Upgrade due to which Atlas REST requests fail during the Solr upgrade.
Hive Metastore (HMS) and HiveServer (HS2) services support rolling restarts and rolling upgrades if High Availability (HA) is enabled for the services. If you are upgrading from CDP 7.1.7 SP2 or CDP 7.1.8 to CDP 7.1.9, you cannot achieve a ZDU for the HS2 service because running queries are aborted during HS2 service restarts. The aborted queries must be manually submitted again because HS2 cannot automatically resubmit these queries.
HS2 has a graceful shutdown feature to overcome this ZDU limitation. Graceful shutdown ensures that HS2 waits for a specified time period (300 seconds by default) before shutting down, thereby allowing queries that are already running, to complete before HS2 stops. This feature aids in achieving a ZDU; however, since the feature is available from CDP 7.1.9 onwards, it can only be used for CDP upgrades from 7.1.9 to higher versions.
You must also be aware of certain limitations related to the Hive upgrade process:
- Hive sticky sessions
- When a client connection is established to Hive through Zookeeper or Knox, a fixed connection is established with one of the HS2 instances. This session is disconnected when HS2 restarts during a Rolling Upgrade. Knox then redirects the fresh connections to another HS2 host instance that is available. As a result, the upcoming queries in the earlier session are aborted and queries are not redirected to the other HS2 instance.
- The queries can be submitted after a fresh connection from Beeline is established with the second HS2 instance.
- However, during a Rolling Upgrade, you must know that the other HS2 instance is also bound to restart and you will encounter the same issue. Therefore, it is recommended that you do not submit new queries until all the HS2 instances are upgraded.
- Hive ODBC connection lost during a Knox ZDU
- During an upgrade of the Knox service, when the Knox service shuts down, the ODBC connection that is established with a Hive host breaks. The issue occurs because the ODBC driver does not accept the cookies that are passed by Knox.
If HS2 establishes a client connection using an ODBC driver and through the Knox load balancer, the Hive ODBC connection is lost when the Knox service stops during a Rolling Upgrade. The issue occurs because the ODBC driver does not accept the cookies that are passed by Knox.
- Establish a new client connection.
Hue supports rolling restart and zero-downtime upgrades in High Availability (HA) clusters, where the Hue service downtime is significantly less when compared to non-ZDU upgrades. The total Hue service downtime depends on various factors such as the number of Hue service role instances, network latency, hardware and connectors used, and more.
This release supports the rolling restart of Impala service. However, zero downtime upgrade is not supported yet as Impala is not an HA service, and has singleton components like catalog and statestore. But the rolling restart has been enhanced to increase the speed by restarting half of the cluster together.
During the restart process, some operations can degrade depending on the services - coordinator / catalog / executors, currently restarted. During catalog restart, metadata operations such as creating new tables, adding new partitions, invalidating metadata, and so on does not work until the restart process is complete. Starting the catalog server can delay some new queries due to the need to load metadata into the cache again.
During coordinator restarts, running queries are killed and new queries can only be submitted to alternative coordinators. Thus Impala enables service failure tolerance but does not guarantee it. While the statestore restarts, you may notice some degradation of admission control and cluster membership updates. If you have a read-only workload that you want to continue running across a rolling restart then you should tune the shutdown_grace_period_s and graceful_shutdown_deadline parameters. The shutdown_grace_period_s time specifies the minimum time the daemon will wait before exiting. The default shutdown_grace_period_s value of 120 seconds means that rolling restart may take a long time if you have many daemons. Changing the shutdown_grace_period_s parameter will not take effect until after the next restart. The graceful_shutdown_deadline parameter specifies the maximum time that Cloudera Manager will wait for a daemon to exit, after the queries have finished. The graceful_shutdown_deadline parameter should be large enough to allow queries to drain, and should also be larger than shutdown_grace_period_s.
You can use the option of retry_failed_queries to automatically resubmit failed select queries due to executor daemons restarting during an upgrade. This option only supports select queries and will not support insert / drop / truncate / create queries.
Cloudera recommendeds that you do not enable the following Kudu features until the upgrade is finalized:
- Data encryption
This is because the downgrade process is not supported if any of these features have been enabled.
YARN Queue Manager
Although Queue Manager does not support rolling upgrade or rolling restart, the amount of time that Queue Manager is down during a rolling upgrade is minimal. When YARN is not available during an upgrade, the Queue Manager UI functions in a read-only mode. This means that the Queue Manager UI displays the last-known state of the system, and you can not change queue settings, placement rules, scheduling rules, and so on. Additionally, you can not make changes through the Queue Manager APIs; therefore, you can not change the YARN Capacity Scheduler configuration.
- When you upgrade from the lower version of CDP Private Cloud Base to CDP Private Cloud Base 7.1.9, the quota related information will be repaired during the cluster upgrade. The upgrade activity will take time based on number of keys present in the system. This is a one time activity to correct the quota and usages information for space and namespace usages.
- If you have the Ozone HttpFS role added to the Ozone service on your 7.1.8 cluster, you must stop and delete the Ozone HttpFS role from the Ozone service before upgrading the Cloudera Runtime cluster to 7.1.9. After you upgrade the Cloudera Runtime cluster to 7.1.9, you can add the HttpFS role back to the Ozone service.
OMID constantly checks the service status with ZooKeeper because OMID depends on ZooKeeper service while in HA mode. This increases the network traffic and might impact the service performance.
The Schema Registry service supports rolling restarts and rolling upgrades if it is running in HA mode. This means that the individual service role instances (the Schema Registry server roles) are stopped and started in a rolling fashion during an upgrade. This allows the service to experience no downtime during an upgrade.
However, to guarantee zero downtime, all clients that connect to Schema Registry must be configured with proper retry and failover mechanisms. Otherwise, you might experience service outages on the client's side. For example, clients might try to connect to a server instance that is temporarily down during the upgrade, if retry and failover is not configured, the client will timeout and fail, resulting in service disruptions.
- Upgrade from Cloudera Runtime 7.1.7 or lower
- Schema Registry does not support ZDU for upgrading from Cloudera Runtime 7.1.7 or lower to a higher version.
- Upgrade from Cloudera Runtime 7.1.8 to 7.1.9 or higher
- If you have services or components (for example, Kafka and Kafka Connect) that run within the CDP cluster and connect to Schema Registry, zero downtime cannot be ensured for this upgrade path. This is because cluster internal services use the Cloudera distributed Schema Registry Java client and the client does not properly support retry and failover in versions 7.1.8 or lower. Additionally, the client version for cluster internal components cannot be upgraded separately before upgrading the cluster.
- Upgrading from Cloudera Runtime 7.1.9 to a higher version
- Schema Registry provides full guarantees for zero downtime as long as the service is in HA mode and all external applications implement proper retry and failover behavior.
Spark History Server
Spark and Spark 3 applications can continue to run during rolling upgrades, but the old parcels (CDH/CDS) must not be removed while such an application runs. The old parcels can be removed once all the Spark applications have restarted.
Starting from CDP 7.1.9, SparkHistoryServer supports HA. Do not enable HA until after the upgrade is finalized, to avoid issues during downgrade.
Livy does not require data backups. Starting from CDP 7.1.9, Livy supports HA. Do not enable HA until after the upgrade is finalized, to avoid issues during the downgrade.