ZDU known issues
If the ZDU process fails, the upgrade is not aborted but paused. You must review the issues to fix the cause of failure and resume with the upgrade.
Cloudera Manager
- When upgrading from CDP Private Cloud Base 7.1.7 SP2 to CDP Private Cloud Base 7.1.9, without agent TLS encryption, the distribute parcel step does not work with Python 3.8.
- If you are using Cloudera Manager running with TLS on port 7183, you must switch the agent communication to use TLS by setting use_tls=1 in agent config.ini and set Use TLS Encryption for Agents to true in Cloudera Manager.
- OPSAPS-67953: Downgrading Cloudera Runtime 7.1.9 version to 7.1.8 CHF 10 fails with errors when a previous unfinished upgrade command is found.
- Error Message:
A previous unfinished upgrade command was found. To continue upgrading: perform a 'Retry' on the original command; to return to an earlier consistent state: restore a Cloudera Manager backup.
- OPSAPS-68279: When upgrading CDP 7.1.7 SP2 to CDP 7.1.9, the command step DeployClientConfig may fail due to the following error:
- Error Message:
Client configuration generation requires the following additional parcels to be activated:[cdh]
- OPSAPS-67929: While upgrading from CDP 7.1.7 SP2 to CDP 7.1.9 version and if there is an upgrade failure in the middle of the process, the Resume option is not available.
- You must reach out to Cloudera Support.
HBase
- CDPD-58755: In a small cluster (for example, five RegionServer or DataNodes or lesser), HBase tests fail intermittently because there are not enough data nodes to failover.
- Error Message:
INFO:cm_server.py:2943:Final list of failed state checks: Health check: HBASE_REGION_SERVERS_HEALTHY failed on HBASE-1 with status BAD, HBASE-1-REGIONSERVER-f3a7628ea499295abb44816cddf04854 (3c605548-280d-4e59-8425-e623ee070a54) has undesired health: BAD from failed check(s): REGION_SERVER_MASTER_CONNECTIVITY: BAD, REGION_SERVER_SCM_HEALTH: BAD
- CDPD-606480: During HBase rolling upgrade, HBase commands fail.
- Error Message:
ServerNotRunningYetException: Server is not running yet
HBase retries DDL operations submitted while the master is initializing until the master is fully initialized to serve the request. However, a situation might arise where the default number of retries or intervals proves to be insufficient for an operation submitted by the client to complete. Such a scenario might occur where the Master crashes due to an issue and the standby master does initialization after the takeover or the master is restarted for any reason, including upgrades taking a longer time.
Knox
- CDPD-60146: During ZDU, API call fails with 404
- Error Message:
logInfo()|perform knox api call https://example.root.hwx.site:5003/gateway/cdp-proxy-api/cm-api/v40/clusters at time: 2023/08/10 04:00:00,000000 with response as { "message":"Not Found", "url":"/gateway/cdp-proxy-api/cm-api/v40/clusters", "status":"404" }
Ozone
- OPSAPS-67340: Service Monitor is in a bad health state after restarting the Cloudera Manager server, reporting problems with descriptor and metric schema age, when Kerberos and Cloudera Manager SPNEGO authentication are enabled.
- Error message:
Health check: MGMT_SERVICE_MONITOR_HEALTH failed on mgmt with status BAD, mgmt-SERVICEMONITOR-02d43fad954612b08a0ea1a5df8f42aa (a4efb7ff-4a4d-4291-9dfe-88a394bf9c23) has undesired health: BAD from failed check(s): SERVICE_MONITOR_METRIC_SCHEMA_FETCH: BAD, SERVICE_MONITOR_SCM_DESCRIPTOR_FETCH: BAD
- CDPQE-25023: Upgrading from CDP 7.1.8 or 7.1.7 SP2 to CDP 7.1.9, HBase post-upgrade validation failed for HBASE_REGION_SERVERS_HEALTHY.
- Error message:
INFO:cm_server.py:2943:Final list of failed state checks: Health check: HBASE_REGION_SERVERS_HEALTHY failed on HBASE-1 with status BAD, HBASE-1-REGIONSERVER-f9673e2d17e546ddec44e7724865d4f5 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): REGION_SERVER_MASTER_CONNECTIVITY: BAD, REGION_SERVER_SCM_HEALTH: BAD, Health check: IMPALA_IMPALADS_HEALTHY failed on IMPALA-1 with status BAD, IMPALA-1-IMPALAD-f9673e2d17e546ddec44e7724865d4f5 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): IMPALAD_QUERY_MONITORING_STATUS: BAD, Health check: SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_HEALTHY failed on SCHEMAREGISTRY-1 with status BAD, SCHEMAREGISTRY-SCHEMA_REGISTRY_SERVER-2 (e0e64fc1-21ee-457f-ba12-0b94fbe9b6ba) has undesired health: BAD from failed check(s): SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_SCM_HEALTH: BAD, SCHEMAREGISTRY-SCHEMA_REGISTRY_SERVER-1 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_SCM_HEALTH: BAD
- CDPD-56498: The Ozone service is in an unhealthy state and the basic canary test fails after Cloudera Manager upgrades from 7.7.1 to 7.11.3 or lower.
- Error message:
The health test result for OZONE_BASIC_HEALTH_CHECK has become bad: PERMISSION_DENIED USER <***username***> doesn't have READ permission to access volume Volume:s3v
Ranger
- CDPD-58860: As part of OPSAPS-67480 in CDP 7.1.9, default ranger policy is added from cdp-proxy-token topology, so that after a new installation of CDP-7.1.9, the knox-ranger policy includes cdp-proxy-token. However, upgrades do not add cdp-proxy-token to cm_knox policies automatically.
- Error message:
None
YARN/MapReduce
- CDPD-59179: Rolling upgrades from CDP 7.1.7 SP2 to CDP 7.1.9 upgrade, the Map Reduce Sleep job's reduce task could fail after rolling upgrade.
- After upgrading from CDP 7.1.7 SP2 to 7.1.9, the map task is completed as "mapreduce.Job: map 100% reduce 0%" but when the reduce is executed, an error occurs only if Resource Manager has a delay in starting.
Data Definition Language (DDL) - Impala, Hive (using HiveQL), Spark (using SparkSQL), HBase, Phoenix, Flink, and Kafka
- CDPD-60862: Rolling restart fails during ZDU when DDL operations are in progress
-
During a Zero Downtime Upgrade (ZDU), the rolling restart of services that support Data Definition Language (DDL) statements might fail if DDL operations are in progress during the upgrade. As a result, ensure that you do not run DDL statements during ZDU.
The following services support DDL statements:- Impala
- Hive – using HiveQL
- Spark – using SparkSQL
- HBase
- Phoenix
- Kafka
Data Manipulation Lanaguage (DML) statements are not impacted and can be used during ZDU. Following the successful upgrade, you can resume running DDL statements.