ZDU known issues

Cloudera Manager🔗

When upgrading from CDP Private Cloud Base 7.1.7 SP2 to CDP Private Cloud Base 7.1.9, without agent TLS encryption, the distribute parcel step does not work with Python 3.8.

If you are using Cloudera Manager running with TLS on port 7183, you must switch the agent communication to use TLS by setting use_tls=1 in agent config.ini and set Use TLS Encryption for Agents to true in Cloudera Manager.

OPSAPS-67953: Downgrading Cloudera Runtime 7.1.9 version to 7.1.8 CHF 10 fails with errors when a previous unfinished upgrade command is found.

Error Message:

A previous unfinished upgrade command
                was found. To continue upgrading: perform a 'Retry' on the original command; to
                return to an earlier consistent state: restore a Cloudera Manager
              backup.

Perform the following steps to rollback to the previous version:

Delete records from the UPGRADE_STATE table.
Restart Cloudera Manager.
Follow the rollback / downgrade steps.

OPSAPS-68279: When upgrading CDP Private Cloud Base 7.1.7 SP2 to CDP Private Cloud Base 7.1.9, the command step DeployClientConfig may fail due to the following error:

Error Message:

Client configuration generation requires
                the following additional parcels to be activated:[cdh]

Verify that the error is because of the 7.1.9 parcels activation failure.

Navigate to the parcels page.
See if the following error is displayed: Error when distributing to <hostname>: Sc file/opt/cloudera/parcels/.flood/CDH-7.1.9-1.cdh7.1.9.p0.43968053-el7.parcel/CDH-7.1. 1.cdh7.1.9.0.43968053-el7.parcel does not exist.
Identify the host from the error message and ssh into the host by running the ssh <hostname> command.
Navigate to the agent directory by running the cd /var/log/cloudera-scm-agent command.
Find the following pattern in agent log file(s) Exception: Untar failed with return code: 2, with tar output: stdout: [b''], stderr: [b'\ngzip: stdin: invalid compressed data--format violated\ntar: Unexpected EOF in archive\ntar: Unexpected EOF in archive\ntar: Error is not recoverable: exiting now\n'].

If the above exception appears, you must restart the agent on that host by running the command systemctl restart cloudera-scm-agent. After restarting the agent, the parcel distribution must be successful.
After the parcel distribution is successful, click resume to continue with the upgrade.

OPSAPS-67929: While upgrading from CDP Private Cloud Base 7.1.7 SP2 to CDP Private Cloud Base 7.1.9 version and if there is an upgrade failure in the middle of the process, the Resume option is not available.: You must reach out to Cloudera Support.

HBase🔗

CDPD-58755: In a small cluster (for example, five RegionServer or DataNodes or lesser), HBase tests fail intermittently because there are not enough data nodes to failover.: Error Message: INFO:cm_server.py:2943:Final list of failed state checks: Health check: HBASE_REGION_SERVERS_HEALTHY failed on HBASE-1 with status BAD, HBASE-1-REGIONSERVER-f3a7628ea499295abb44816cddf04854 (3c605548-280d-4e59-8425-e623ee070a54) has undesired health: BAD from failed check(s): REGION_SERVER_MASTER_CONNECTIVITY: BAD, REGION_SERVER_SCM_HEALTH: BAD; Add the following configurations using Cloudera Manager to alleviate this issue.

Go to Cloudera Manager > HDFS > Configuration > HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml.

Add the following configurations with the suggested values.
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS dfs.client.block.write.replace-datanode-on-failure.best-effort=true dfs.client.block.write.replace-datanode-on-failure.enable=true

CDPD-606480: During HBase rolling upgrade, HBase commands fail.

Error Message:

ServerNotRunningYetException: Server is
                not running yet

HBase retries DDL operations submitted while the master is initializing until the master is fully initialized to serve the request. However, a situation might arise where the default number of retries or intervals proves to be insufficient for an operation submitted by the client to complete. Such a scenario might occur where the Master crashes due to an issue and the standby master does initialization after the takeover or the master is restarted for any reason, including upgrades taking a longer time.

The initialization time of the master can differ depending on the size of the cluster; however, implementing the following configuration adjustments in your client application can support the master getting initialized up to 10 minutes.

Add the following configuration to the client application.

<property>
    <name>hbase.client.pause</name>
    <value>300</value>
 </property>
 <property>
    <name>hbase.client.retries.number</name>
    <value>20</value>
 </property>

However, if you have seen a longer or shorter master initialization period, you can modify these values accordingly. These retry settings apply to all types of calls to HBase service, encompassing GET, SCAN, MUTATE, and DDLs.

Ozone🔗

OPSAPS-67340: Service Monitor is in a bad health state after restarting the Cloudera Manager server, reporting problems with descriptor and metric schema age, when Kerberos and Cloudera Manager SPNEGO authentication are enabled.: Error message: Health check: MGMT_SERVICE_MONITOR_HEALTH failed on mgmt with status BAD, mgmt-SERVICEMONITOR-02d43fad954612b08a0ea1a5df8f42aa (a4efb7ff-4a4d-4291-9dfe-88a394bf9c23) has undesired health: BAD from failed check(s): SERVICE_MONITOR_METRIC_SCHEMA_FETCH: BAD, SERVICE_MONITOR_SCM_DESCRIPTOR_FETCH: BAD; Restart Service Monitor each time the Cloudera Manager server is restarted:

Log in to Cloudera Manager.

Stop Service Monitor.

Restart Cloudera Manager server.

Start Service Monitor.

note
If the health status is already bad, restart Service Monitor.
CDPQE-25023: Upgrading from CDP Private Cloud Base 7.1.8 or 7.1.7 SP2 to CDP Private Cloud Base 7.1.9, HBase post-upgrade validation failed for HBASE_REGION_SERVERS_HEALTHY.: Error message: INFO:cm_server.py:2943:Final list of failed state checks: Health check: HBASE_REGION_SERVERS_HEALTHY failed on HBASE-1 with status BAD, HBASE-1-REGIONSERVER-f9673e2d17e546ddec44e7724865d4f5 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): REGION_SERVER_MASTER_CONNECTIVITY: BAD, REGION_SERVER_SCM_HEALTH: BAD, Health check: IMPALA_IMPALADS_HEALTHY failed on IMPALA-1 with status BAD, IMPALA-1-IMPALAD-f9673e2d17e546ddec44e7724865d4f5 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): IMPALAD_QUERY_MONITORING_STATUS: BAD, Health check: SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_HEALTHY failed on SCHEMAREGISTRY-1 with status BAD, SCHEMAREGISTRY-SCHEMA_REGISTRY_SERVER-2 (e0e64fc1-21ee-457f-ba12-0b94fbe9b6ba) has undesired health: BAD from failed check(s): SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_SCM_HEALTH: BAD, SCHEMAREGISTRY-SCHEMA_REGISTRY_SERVER-1 (043f7f4a-c46d-42af-8d82-061cba1c3f7c) has undesired health: BAD from failed check(s): SCHEMAREGISTRY_SCHEMA_REGISTRY_SERVER_SCM_HEALTH: BAD; Log in to Cloudera Manager.

Navigate to Clusters.

Click HDFS.

Go to the Configurations tab.

In the Search tab, search for the HDFS Client Advanced Configuration Snippet (Safety Valve) parameter for hdfs-site.xml.

Add the following:
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS dfs.client.block.write.replace-datanode-on-failure.best-effort=true dfs.client.block.write.replace-datanode-on-failure.enable=true
CDPD-56498: The Ozone service is in an unhealthy state and the basic canary test fails after Cloudera Manager upgrades from 7.7.1 to 7.11.3 or lower.: Error message: The health test result for OZONE_BASIC_HEALTH_CHECK has become bad: PERMISSION_DENIED USER <***username***> doesn't have READ permission to access volume Volume:s3v; You must add the following Ranger policies manually after you upgrade Cloudera Manager.
1."name": "S3_VOLUME_POLICY_FOR_OZONE_CANARY","resource.volume": "s3v","users": "hue","accessTypes": "create,write,read,list", 2."name": "S3_BUCKET_POLICY_FOR_OZONE_CANARY","resource.volume": "s3v","resource.bucket": "cloudera-health-monitoring-ozone-basic-canary-bucket","users": "hue","accessTypes": "create,write,read,list", 3."name": "S3_KEY_POLICY_FOR_OZONE_CANARY","resource.volume": "s3v","resource.bucket": "cloudera-health-monitoring-ozone-basic-canary-bucket","resource.key": "cloudera-health-monitoring-ozone-basic-canary-key","users": "hue","accessTypes": "create,write,read,list,delete"
The Ozone canary is enabled in OZONE CSD supporting CDP Private Cloud Base 7.1.7, 7.1.8, and 7.1.9. So, after upgrading from the older Cloudera Manager version to Cloudera Manager 7.11.3. You must add the above policies to the Cloudera Base on premises cluster on the upgraded Cloudera Manager. You must add the above policies for both the Hue system user and the Hue principal name if you have a custom principal name or system user for Hue. This works seamlessly with or without an ATL rule. The default value for users is "users": "hue". The SMON principal must be the same as the Hue principal.

Ranger🔗

CDPD-58860: As part of OPSAPS-67480 in CDP Private Cloud Base 7.1.9, default ranger policy is added from cdp-proxy-token topology, so that after a new installation of CDP Private Cloud Base 7.1.9, the knox-ranger policy includes cdp-proxy-token. However, upgrades do not add cdp-proxy-token to cm_knox policies automatically.: Error message: None; Manually add cdp-proxy-token to the knox policy, using Ranger Admin Web UI.

Log in to Cloudera Manager > Ranger > Ranger Admin Web UI, as a Ranger administrator.

On Ranger Admin Web UI > Service Manager > Resource > Knox, click cm_knox.

In Knox Policies, open the CDP Proxy UI, API and Token policy.

In Knox Topology*, add cdp-proxy-token.

Click Save.

Restart Ranger.

YARN/MapReduce🔗

CDPD-59179: Rolling upgrades from CDP Private Cloud Base 7.1.7 SP2 to CDP Private Cloud Base 7.1.9 upgrade, the Map Reduce Sleep job's reduce task could fail after rolling upgrade.: After upgrading from CDP Private Cloud Base 7.1.7 SP2 to 7.1.9, the map task is completed as "mapreduce.Job: map 100% reduce 0%" but when the reduce is executed, an error occurs only if Resource Manager has a delay in starting.; None.
COMPX-18506: QueueManager Fails to restart after parcel downgrade: Downgrading from 7.3.1 to CDP Private Cloud Base 7.1.8 fails for Queue Manager citing incorrect username/password for Config Store.; SSH into the Queue Manager instance

Navigate to the path where config store H2 DB is (default path: /var/lib/hadoop-yarn)

Rename the following two DB files:

config-service.mv.db

config-service.trace.db

After this restart Queue Manager or continue downgrade

note
This will result in losing user version change history.
CDPQE-36036: Change QueueManager backup and restore process during rollback above CDP Private Cloud Base 7.1.9: Downgrading from Cloudera Base on premises 7.3.1 to CDP Private Cloud Base 7.1.9 SP1 fails for Queue Manager.; Perform Yarn QueueManager db restore instead of QueueManager config restore because, from CDP Private Cloud Base 7.1.9.0 Queue Manager uses PSQL DB instead of H2 so the backup and restore process is updated accordingly.

Data Definition Language (DDL) - Impala, Hive (using HiveQL), Spark (using SparkSQL), HBase, Phoenix, Flink, and Kafka🔗

CDPD-60862: Rolling restart fails during ZDU when DDL operations are in progress

During a Zero Downtime Upgrade (ZDU), the rolling restart of services that support Data Definition Language (DDL) statements might fail if DDL operations are in progress during the upgrade. As a result, ensure that you do not run DDL statements during ZDU.

The following services support DDL statements:

Impala
Hive – using HiveQL
Spark – using SparkSQL
HBase
Phoenix
Kafka

Data Manipulation Lanaguage (DML) statements are not impacted and can be used during ZDU. Following the successful upgrade, you can resume running DDL statements.

None. recommends modifying applications to not use DDL statements for the duration of the upgrade. If the upgrade is already in progress, and you have experienced a service failure, you can remove the DDLs in-flight and resume the upgrade from the point of failure.

ZDU known issues

Cloudera Manager🔗

HBase🔗

Ozone🔗

Ranger🔗

YARN/MapReduce🔗

Data Definition Language (DDL) - Impala, Hive (using HiveQL), Spark (using SparkSQL), HBase, Phoenix, Flink, and Kafka🔗

We want your opinion

How can we improve this page?