ZDU known issues

If the ZDU process fails, the upgrade is not aborted but paused. You must review the issues to fix the cause of failure and resume with the upgrade.

Cloudera Manager

When upgrading from CDP Private Cloud Base 7.1.7 SP2 to CDP Private Cloud Base 7.1.9, without agent TLS encryption, the distribute parcel step does not work with Python 3.8.
If you are using Cloudera Manager running with TLS on port 7183, you must switch the agent communication to use TLS by setting use_tls=1 in agent config.ini and set Use TLS Encryption for Agents to true in Cloudera Manager.
OPSAPS-67953: Downgrading Cloudera Runtime 7.1.9 version to 7.1.8 CHF 10 fails with errors when a previous unfinished upgrade command is found.
Error Message: A previous unfinished upgrade command was found. To continue upgrading: perform a 'Retry' on the original command; to return to an earlier consistent state: restore a Cloudera Manager backup.

Perform the following steps to rollback to the previous version:

  1. Delete records from the UPGRADE_STATE table.
  2. Restart Cloudera Manager.
  3. Follow the rollback / downgrade steps.
OPSAPS-68279: When upgrading CDP 7.1.7 SP2 to CDP 7.1.9, the command step DeployClientConfig may fail due to the following error:
Error Message:Client configuration generation requires the following additional parcels to be activated:[cdh]
Verify that the error is because of the 7.1.9 parcels activation failure.
  1. Navigate to the parcels page.
  2. See if the following error is displayed: Error when distributing to <hostname>: Sc file/opt/cloudera/parcels/.flood/CDH-7.1.9-1.cdh7.1.9.p0.43968053-el7.parcel/CDH-7.1. 1.cdh7.1.9.0.43968053-el7.parcel does not exist.
  3. Identify the host from the error message and ssh into the host by running the ssh <hostname> command.
  4. Navigate to the agent directory by running the cd /var/log/cloudera-scm-agent command.
  5. Find the following pattern in agent log file(s) Exception: Untar failed with return code: 2, with tar output: stdout: [b''], stderr: [b'\ngzip: stdin: invalid compressed data--format violated\ntar: Unexpected EOF in archive\ntar: Unexpected EOF in archive\ntar: Error is not recoverable: exiting now\n'].
  1. If the above exception appears, you must restart the agent on that host by running the command systemctl restart cloudera-scm-agent. After restarting the agent, the parcel distribution must be successful.
  2. After the parcel distribution is successful, click resume to continue with the upgrade.
OPSAPS-67929: While upgrading from CDP 7.1.7 SP2 to CDP 7.1.9 version and if there is an upgrade failure in the middle of the process, the Resume option is not available.
You must reach out to Cloudera Support.

HBase

CDPD-58755: In a small cluster (for example, five RegionServer or DataNodes or lesser), HBase tests fail intermittently because there are not enough data nodes to failover.
Error Message: INFO:cm_server.py:2943:Final list of failed state checks: Health check: HBASE_REGION_SERVERS_HEALTHY failed on HBASE-1 with status BAD, HBASE-1-REGIONSERVER-f3a7628ea499295abb44816cddf04854 (3c605548-280d-4e59-8425-e623ee070a54) has undesired health: BAD from failed check(s): REGION_SERVER_MASTER_CONNECTIVITY: BAD, REGION_SERVER_SCM_HEALTH: BAD

Add the following configurations using Cloudera Manager to alleviate this issue.

  1. Go to Cloudera Manager > HDFS > Configuration > HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.
  2. Add the following configurations with the suggested values.
    dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS
    dfs.client.block.write.replace-datanode-on-failure.best-effort=true 
    dfs.client.block.write.replace-datanode-on-failure.enable=true
CDPD-606480: During HBase rolling upgrade, HBase commands fail.
Error Message: ServerNotRunningYetException: Server is not running yet

HBase retries DDL operations submitted while the master is initializing until the master is fully initialized to serve the request. However, a situation might arise where the default number of retries or intervals proves to be insufficient for an operation submitted by the client to complete. Such a scenario might occur where the Master crashes due to an issue and the standby master does initialization after the takeover or the master is restarted for any reason, including upgrades taking a longer time.

The initialization time of the master can differ depending on the size of the cluster; however, implementing the following configuration adjustments in your client application can support the master getting initialized up to 10 minutes.

Add the following configuration to the client application.

<property>
    <name>hbase.client.pause</name>
    <value>300</value>
 </property>
 <property>
    <name>hbase.client.retries.number</name>
    <value>20</value>
 </property>

However, if you have seen a longer or shorter master initialization period, you can modify these values accordingly. These retry settings apply to all types of calls to HBase service, encompassing GET, SCAN, MUTATE, and DDLs.

Knox

CDPD-60146: During ZDU, API call fails with 404
Error Message: logInfo()|perform knox api call https://example.root.hwx.site:5003/gateway/cdp-proxy-api/cm-api/v40/clusters at time: 2023/08/10 04:00:00,000000 with response as { "message":"Not Found", "url":"/gateway/cdp-proxy-api/cm-api/v40/clusters", "status":"404" }
During rolling upgrade or rolling restart, Knox instances are not able to handle requests while in the stopping/starting transition. To avoid issues with the load balancer directing requests to "down" Knox instances:

Retry the request.

Ozone

OPSAPS-67340: Service Monitor is in a bad health state after restarting the Cloudera Manager server, reporting problems with descriptor and metric schema age, when Kerberos and Cloudera Manager SPNEGO authentication are enabled.
Error message: Health check: MGMT_SERVICE_MONITOR_HEALTH failed on mgmt with status BAD, mgmt-SERVICEMONITOR-02d43fad954612b08a0ea1a5df8f42aa (a4efb7ff-4a4d-4291-9dfe-88a394bf9c23) has undesired health: BAD from failed check(s): SERVICE_MONITOR_METRIC_SCHEMA_FETCH: BAD, SERVICE_MONITOR_SCM_DESCRIPTOR_FETCH: BAD

Restart Service Monitor each time the Cloudera Manager server is restarted:

  1. Log in to Cloudera Manager.
  2. Stop Service Monitor.
  3. Restart Cloudera Manager server.
  4. Start Service Monitor.
CDPQE-25023: Upgrading from CDP 7.1.8 or 7.1.7 SP2 to CDP 7.1.9, HBase post-upgrade validation failed for HBASE_REGION_SERVERS_HEALTHY.
Error message: INFO:cm_server.py:2943:Final list of failed state checks: Health check: HBASE_REGION_SERVERS_HEALTHY failed on HBASE-1 with status BAD, HBASE-1-REGIONSERVER-f9673e2d17e546ddec44e7724865d4f5 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): REGION_SERVER_MASTER_CONNECTIVITY: BAD, REGION_SERVER_SCM_HEALTH: BAD, Health check: IMPALA_IMPALADS_HEALTHY failed on IMPALA-1 with status BAD, IMPALA-1-IMPALAD-f9673e2d17e546ddec44e7724865d4f5 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): IMPALAD_QUERY_MONITORING_STATUS: BAD, Health check: SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_HEALTHY failed on SCHEMAREGISTRY-1 with status BAD, SCHEMAREGISTRY-SCHEMA_REGISTRY_SERVER-2 (e0e64fc1-21ee-457f-ba12-0b94fbe9b6ba) has undesired health: BAD from failed check(s): SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_SCM_HEALTH: BAD, SCHEMAREGISTRY-SCHEMA_REGISTRY_SERVER-1 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_SCM_HEALTH: BAD
  1. Log in to Cloudera Manager.
  2. Navigate to Clusters.
  3. Click HDFS.
  4. Go to the Configurations tab.
  5. In the Search tab, search for the HDFS Client Advanced Configuration Snippet (Safety Valve) parameter for hdfs-site.xml.
  6. Add the following:
    dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS dfs.client.block.write.replace-datanode-on-failure.best-effort=true dfs.client.block.write.replace-datanode-on-failure.enable=true
CDPD-56498: The Ozone service is in an unhealthy state and the basic canary test fails after Cloudera Manager upgrades from 7.7.1 to 7.11.3 or lower.
Error message: The health test result for OZONE_BASIC_HEALTH_CHECK has become bad: PERMISSION_DENIED USER <***username***> doesn't have READ permission to access volume Volume:s3v
You must add the following Ranger policies manually after you upgrade Cloudera Manager.
1."name": "S3_VOLUME_POLICY_FOR_OZONE_CANARY","resource.volume": "s3v","users": "hue","accessTypes": "create,write,read,list",
                2."name": "S3_BUCKET_POLICY_FOR_OZONE_CANARY","resource.volume": "s3v","resource.bucket": "cloudera-health-monitoring-ozone-basic-canary-bucket","users": "hue","accessTypes": "create,write,read,list",
                3."name": "S3_KEY_POLICY_FOR_OZONE_CANARY","resource.volume": "s3v","resource.bucket": "cloudera-health-monitoring-ozone-basic-canary-bucket","resource.key": "cloudera-health-monitoring-ozone-basic-canary-key","users": "hue","accessTypes": "create,write,read,list,delete"

The Ozone canary is enabled in OZONE CSD supporting CDP 7.1.7, 7.1.8, and 7.1.9. So, after upgrading from the older Cloudera Manager version to Cloudera Manager 7.11.3. You must add the above policies to the CDP Private Cloud Base cluster on the upgraded Cloudera Manager. You must add the above policies for both the Hue system user and the Hue principal name if you have a custom principal name or system user for Hue. This works seamlessly with or without an ATL rule. The default value for users is "users": "hue". The SMON principal must be the same as the Hue principal.

Ranger

CDPD-58860: As part of OPSAPS-67480 in CDP 7.1.9, default ranger policy is added from cdp-proxy-token topology, so that after a new installation of CDP-7.1.9, the knox-ranger policy includes cdp-proxy-token. However, upgrades do not add cdp-proxy-token to cm_knox policies automatically.
Error message: None

Manually add cdp-proxy-token to the knox policy, using Ranger Admin Web UI.

  1. Log in to Cloudera Manager > Ranger > Ranger Admin Web UI, as a Ranger administrator.
  2. On Ranger Admin Web UI > Service Manager > Resource > Knox, click cm_knox.
  3. In Knox Policies, open the CDP Proxy UI, API and Token policy.
  4. In Knox Topology*, add cdp-proxy-token.
  5. Click Save.
  6. Restart Ranger.

YARN/MapReduce

CDPD-59179: Rolling upgrades from CDP 7.1.7 SP2 to CDP 7.1.9 upgrade, the Map Reduce Sleep job's reduce task could fail after rolling upgrade.
After upgrading from CDP 7.1.7 SP2 to 7.1.9, the map task is completed as "mapreduce.Job: map 100% reduce 0%" but when the reduce is executed, an error occurs only if Resource Manager has a delay in starting.
None.

Data Definition Language (DDL) - Impala, Hive (using HiveQL), Spark (using SparkSQL), HBase, Phoenix, Flink, and Kafka

CDPD-60862: Rolling restart fails during ZDU when DDL operations are in progress

During a Zero Downtime Upgrade (ZDU), the rolling restart of services that support Data Definition Language (DDL) statements might fail if DDL operations are in progress during the upgrade. As a result, ensure that you do not run DDL statements during ZDU.

The following services support DDL statements:
  • Impala
  • Hive – using HiveQL
  • Spark – using SparkSQL
  • HBase
  • Phoenix
  • Kafka

Data Manipulation Lanaguage (DML) statements are not impacted and can be used during ZDU. Following the successful upgrade, you can resume running DDL statements.

None. Cloudera recommends modifying applications to not use DDL statements for the duration of the upgrade. If the upgrade is already in progress, and you have experienced a service failure, you can remove the DDLs in-flight and resume the upgrade from the point of failure.