Service components limitations

You should be aware of certain limitations and considerations related to service components such as Atlas, Kudu, and YARN, to ensure that your cluster can successfully complete the Zero Downtime Upgrade (ZDU) process.

Atlas

Certain Atlas limitations related to the ZDU process are as follows:

  • While Atlas goes through the process of rolling upgrade, some downtime might be expected because Atlas does not support the Active-Active model. Failover takes time because Active-Passive is the currently supported model. While the Passive instance becomes Active, there is some downtime where Atlas is not reachable and the messages from clients are queued up in Kafka.
  • Solr does not support Rolling Upgrade due to which Atlas REST requests fail during the Solr upgrade.

Hive

Hive Metastore (HMS) and HiveServer (HS2) services support rolling restarts and rolling upgrades if High Availability (HA) is enabled for the services. If you are upgrading from CDP 7.1.7 SP2 or CDP 7.1.8 to CDP 7.1.9, you cannot achieve a ZDU for the HS2 service because running queries are aborted during HS2 service restarts. The aborted queries must be manually submitted again because HS2 cannot automatically resubmit these queries.

HS2 has a graceful shutdown feature to overcome this ZDU limitation. Graceful shutdown ensures that HS2 waits for a specified time period (300 seconds by default) before shutting down, thereby allowing queries that are already running, to complete before HS2 stops. This feature aids in achieving a ZDU; however, since the feature is available from CDP 7.1.9 onwards, it can only be used for CDP upgrades from 7.1.9 to higher versions.

You must also be aware of certain limitations related to the Hive upgrade process:

Hive sticky sessions
When a client connection is established to Hive through Zookeeper or Knox, a fixed connection is established with one of the HS2 instances. This session is disconnected when HS2 restarts during a Rolling Upgrade. Knox then redirects the fresh connections to another HS2 host instance that is available. As a result, the upcoming queries in the earlier session are aborted and queries are not redirected to the other HS2 instance.
The queries can be submitted after a fresh connection from Beeline is established with the second HS2 instance.
However, during a Rolling Upgrade, you must know that the other HS2 instance is also bound to restart and you will encounter the same issue. Therefore, it is recommended that you do not submit new queries until all the HS2 instances are upgraded.
Hive ODBC connection lost during a Knox ZDU
During an upgrade of the Knox service, when the Knox service shuts down, the ODBC connection that is established with a Hive host breaks. The issue occurs because the ODBC driver does not accept the cookies that are passed by Knox.

If HS2 establishes a client connection using an ODBC driver and through the Knox load balancer, the Hive ODBC connection is lost when the Knox service stops during a Rolling Upgrade. The issue occurs because the ODBC driver does not accept the cookies that are passed by Knox.

Establish a new client connection.

Hue

Hue supports rolling restart and zero-downtime upgrades in High Availability (HA) clusters, where the Hue service downtime is significantly less when compared to non-ZDU upgrades. The total Hue service downtime depends on various factors such as the number of Hue service role instances, network latency, hardware and connectors used, and more.

Impala

Impala supports rolling restarts and zero-downtime upgrades (ZDU) in High Availability clusters, significantly reducing Impala service downtime compared to non-ZDU upgrades..

During the restart process, some operations can degrade depending on the services - coordinator / catalog / executors, currently restarted. During catalog restart, metadata operations such as creating new tables, adding new partitions, invalidating metadata, and so on do not work until the restart process is complete. Starting the catalog server can delay some new queries due to the need to load metadata into the cache again. During ZDU, there might be a brief window of downtime during Catalog or StateStore failover. For details see Failure detection for Catalog and StateStore

During coordinator restarts, running queries can be killed and new queries can only be submitted to alternative coordinators. Thus Impala enables service failure tolerance but does not guarantee it. To minimize potential failures during ZDU, ensure that the coordinators are managed by a load balancer. While the statestore restarts, you may notice some degradation of admission control and cluster membership updates. If you have a read-only workload that you want to continue running across a rolling restart then you should tune the shutdown_grace_period_s and graceful_shutdown_deadline parameters. The shutdown_grace_period_s time specifies the minimum time the daemon will wait before exiting. The default shutdown_grace_period_s value of 120 seconds means that rolling restart may take a long time if you have many daemons. Changing the shutdown_grace_period_s parameter will not take effect until after the next restart. The graceful_shutdown_deadline parameter specifies the maximum time that Cloudera Manager will wait for a daemon to exit, after the queries have finished. The graceful_shutdown_deadline parameter should be large enough to allow queries to drain, and should also be larger than shutdown_grace_period_s. During the graceful shutdown period, any new requests would receive a SERVER_SHUTTING_DOWNerror. If the load balancer is unable to handle this error during the ZDU, the request would fail, and the load balancer may not be able to detect that the coordinator is unavailable.

You can use the option of retry_failed_queries to automatically resubmit failed select queries due to executor daemons restarting during an upgrade. This option only supports select queries and will not support insert / drop / truncate / create queries.

Knox

During ZDU, some client requests might fail with HTTP 503 responses. Knox starts listening on port 8443 before having completed the activation of all topologies, during which time the endpoints associated with the inactive topologies are temporarily unavailable. Any component proxied by Knox is affected (for example, Schema Registry).

As a workaround, clients should respond to HTTP 503 responses with retries.

Kudu

Cloudera recommends that you do not enable the following Kudu features until the upgrade is finalized:

  • Auto-increment
  • Data encryption

This is because the downgrade process is not supported if any of these features have been enabled.

YARN Queue Manager

Although Queue Manager does not support rolling upgrade or rolling restart, the amount of time that Queue Manager is down during a rolling upgrade is minimal. When YARN is not available during an upgrade, the Queue Manager UI functions in a read-only mode. This means that the Queue Manager UI displays the last-known state of the system, and you can not change queue settings, placement rules, scheduling rules, and so on. Additionally, you can not make changes through the Queue Manager APIs; therefore, you can not change the YARN Capacity Scheduler configuration.

Ozone

  • When you upgrade from the lower version of CDP Private Cloud Base to CDP Private Cloud Base 7.1.9, the quota related information will be repaired during the cluster upgrade. The upgrade activity will take time based on number of keys present in the system. This is a one time activity to correct the quota and usages information for space and namespace usages.
  • If you have the Ozone HttpFS role added to the Ozone service on your 7.1.8 cluster, you must stop and delete the Ozone HttpFS role from the Ozone service before upgrading the Cloudera Runtime cluster to 7.1.9. After you upgrade the Cloudera Runtime cluster to 7.1.9, you can add the HttpFS role back to the Ozone service.

OMID

OMID constantly checks the service status with ZooKeeper because OMID depends on ZooKeeper service while in HA mode. This increases the network traffic and might impact the service performance.

Schema Registry

The Schema Registry service supports rolling restarts and rolling upgrades if it is running in HA mode. This means that the individual service role instances (the Schema Registry server roles) are stopped and started in a rolling fashion during an upgrade. This allows the service to experience no downtime during an upgrade.

However, to guarantee zero downtime, all clients that connect to Schema Registry must be configured with proper retry and failover mechanisms. Otherwise, you might experience service outages on the client's side. For example, clients might try to connect to a server instance that is temporarily down during the upgrade, if retry and failover is not configured, the client will timeout and fail, resulting in service disruptions.

The Cloudera distributed Schema Registry Java client only supports retry and failover in CDP versions 7.1.9 or higher. As a result of this, Schema Registry's guarantees zero downtime during upgrades differ depending on your upgrade path as well as your deployment and use case.
Upgrade from Cloudera Runtime 7.1.7 or lower
Schema Registry does not support ZDU for upgrading from Cloudera Runtime 7.1.7 or lower to a higher version.
Upgrade from Cloudera Runtime 7.1.8 to 7.1.9 or higher
If you have services or components (for example, Kafka and Kafka Connect) that run within the CDP cluster and connect to Schema Registry, zero downtime cannot be ensured for this upgrade path. This is because cluster internal services use the Cloudera distributed Schema Registry Java client and the client does not properly support retry and failover in versions 7.1.8 or lower. Additionally, the client version for cluster internal components cannot be upgraded separately before upgrading the cluster.
Upgrading from Cloudera Runtime 7.1.9 to a higher version
Schema Registry provides full guarantees for zero downtime as long as the service is in HA mode and all external applications implement proper retry and failover behavior.

Solr

Upgrade from Cloudera Runtime 7.1.8 or lower
Solr does not support ZDU for upgrading from Cloudera Runtime 7.1.8 or lower to a higher version.
Upgrade from Cloudera Runtime 7.1.9 or higher
Solr supports rolling restart when upgrading from Cloudera Runtime 7.1.9 or higher, provided it is running in HA mode. This means that the individual service nodes are stopped and started in a rolling fashion during an upgrade.

Spark History Server

Spark and Spark 3 applications can continue to run during rolling upgrades, but the old parcels (CDH/CDS) must not be removed while such an application runs. The old parcels can be removed once all the Spark applications have restarted.

Starting from CDP 7.1.9, SparkHistoryServer supports HA. Do not enable HA until after the upgrade is finalized, to avoid issues during downgrade.

Livy

Livy does not require data backups. Starting from CDP 7.1.9, Livy supports HA. Do not enable HA until after the upgrade is finalized, to avoid issues during the downgrade.