Known issues for the CDP Private Cloud Data Services Management Console
This section lists known issues that you might run into while using the CDP Private Cloud Management Console service.
Known Issues in Management Console 1.5.3
- DOCS-20088/OPSX-4781: Vault pods may take long time to be ready during upgrades from 1.5.2 to 1.5.3
- The 'vault-0' pod takes longer time to attach volume in some upgrade cases than usual. Due to the excess time taken the cluster upgrade may fail. But, usually in 15 minutes the volume can attach automatically and the pod would start running. In that case, the user can resume the upgrade.
- OPSX-4777: [151h2-153]post ECS upgrade Longhorn health test failing - helm-install-longhorn pod in crashloop state
-
After upgrading CDP Private Cloud Data Services on ECS, the Longhorn health test fails with the helm-install-longhorn pod in crashloop state.
- OPSX-4754 [ECS Restart Stability] DaemonSet rollout process is stuck post rolling restart where DaemonSet kube-system/rke2-canal has not finished or progressed for at least 15 minutes
-
On RHEL 9.x, an ECS service DaemonSet rollout health alert appears in the Cloudera Manager after an ECS installation and a rolling restart.
- DOCS-19913: OCP upgrade – OCP namespace name must be 29 characters or less
- Before upgrading on OCP, ensure that the OpenShift namespace name is 29 characters or less. Do not proceed with the upgrade if the namespace name is 30 or more characters in length.
- OPSAPS-69892: kube-proxy failure causing issues with cluster
-
After rebooting/restarting an ECS agent node, the kube-proxy Linux process may not start due to a race condition in the kubelet. When this happens, ECS cluster networking and other services – such as Vault, DNS, authentication, Longhorn storage, etc. – are affected. At the Kubernetes pod level, errors such as "connection refused", "connection timed out" and "i/o timeout" may be observed. If you suspect possible networking issues in your ECS cluster, checking kube-proxy is a good first step.
- OPSX-4754 [ECS Restart Stability] DaemonSet rollout is stuck post rolling restart - DaemonSet kube-system/rke2-canal has not finished or progressed for at least 15 minutes
-
On RHEL 9.1, an ECS service DaemonSet rollout health alert appears after a rolling restart.
- OPSX-4766: [ECS Restart]| Host Reboot | start command failed with error - "Timed out waiting for kube-apiserver to be ready"
-
In an ECS cluster with HA enabled, ECS Start fails with an error after stopping the cluster and rebooting the hosts.
Steps to reproduce:
- Stop ECS.
- Reboot hosts.
- Start ECS.
The start command fails with the following error message:
"Timed out waiting for kube-apiserver to be ready"
- GPU Support: RHEL 8.8 only
- GPU support is only offered with RHEL 8.8.
Known Issues in Management Console 1.5.2
- OPSX-4650: CM - OCP pvc install Wizard - fails if route name is too long
- If the cluster name given during Data Services installation on OpenShift is too long, you may encounter an installation failure, and the log will contain the following error: "The Route ... is invalid: spec.host: Invalid value: ...: must be no more than 63 characters"
- OPSX-4369: FreeIPA generated krb5.conf must use a file-based cache
- Private Cloud Data Services requires Kerberos to be enabled in the Private Cloud Base
cluster. Furthermore, the
/etc/krb5.conf
file must be configured to use a file-based cache. Using a keyring-based cache or a KCM-based cache is not supported. When using FreeIPA, thekrb5.conf
file may be configured to use a KCM-based cache by default. This should be changed. - OPSAPS-68923: CM - After CM upgrade from 7.9.5 to 7.11.3.x ECS cluster showing stale config
- After Cloudera Manager upgrade from 7.9.5 to 7.11.3.x, an ECS 1.5.0 cluster may show a stale config to add ""limit_fds": 1048576"
- COMPX-15475: [CM ECS UPG][150-152] post upgrade prometheus-node-exporter-1.6.0 pod stuck in pending state
- As part of the upgrade all nodes in the cluster are restarted. If a set of pods remain in the pending state after the node restart, a YuniKorn restart is required.
- OPSX-4594: [ECS Restart Stability] Post rolling restart few volumes are in detached state (vault being one of them)
-
After rolling restart there may be some volumes in detached state.
- OPSAPS-68558: [7.9.5->7.11.3.2] CM upgrade failed with BeanCreationException: Error creating bean with name 'com.cloudera.server.cmf.TrialState'
-
After upgrading the Cloudera Manager package, the Cloudera Manager Server does not start. An error about "Active Commands" is shown in the Cloudera Manager Server log.
This may happen when the Private Cloud Data Services Control Plane is actively issuing requests to Cloudera Manager while an upgrade is being performed.
- OPSX-4392: Getting the real client IP address in the application
- CML has a feature for adding the audit event for each user action (Monitoring User Events). In Private Cloud, instead of the client IP, we are getting the internal IP, which is logged into the internal DB.
- OPSX-4446: Duplicate Entries in cdp-pvc-truststore
- Sometimes the cdp-pvc-truststore contains duplicate entries causing a 3M Request Entry to Large error.
- OPSX-4552: [ECS Restart] One of the docker servers failed to come up after starting the cluster post hosts reboot
- At times the Docker server may fail to come up and return the following error
message:
/var/run/docker.sock: Is a directory
- CDPVC-1137, CDPAM-4388, COMPX-15083, and COMPX-15418: OpenShift Container Platform version upgrade from 4.10 to 4.11 fails due to a Pod Disruption Budget (PDB) issue
- PDB can prevent a node from draining which makes the nodes to
report the
Ready,SchedulingDisabled
state. As a result, the node is not updated to correct the Kubernetes version when you upgrade OCP from 4.10 to 4.11.
- PULSE-944 and PULSE-941 Observability namespace not created after platform upgrade from 151 to 152
- The Cloudera Observability namespace is not created after a
platform upgrade from PvC DS 1.5.1 to PvC DS 1.5.2.
During the creation of the resource pool the Cloudera Observability namespace is provided by the CDP Private Cloud Service. If the provisioning flow is not completed, such as due to a timing difference between the start of the computeAPI pod and the call to the computeAPI pod by the service, the namespace is not created.
- PULSE-921 Observability namespace has no pods
- The Cloudera Observability namespace should have the same number of pods and nodes. When the Cloudera Observability namespace has no pods the prometheus-node-exporter-1.6.0 helm release state becomes invalid and the CDP Private Cloud Service is unable to uninstall and reinstall the namespace. Also, as the Node Exporter is not installed into the Cloudera Observability namespace its metrics are unavailable when querying Prometheus in the control plane, for example the node_cpu_seconds_total metric.
- PULSE-697 Add node-exporter to PvC DS
- When expanding a cluster with new nodes and there is insufficient CPU and memory resources, the Node Exporter will encounter difficulties deploying new pods on the additional nodes.
- PULSE-935 Longhorn volumes are over 90% of the capacity alerts on Prometheus volumes
- Cloudera Manager displays the following alert about your
Prometheus volumes: Concerning: Firing alerts for Longhorn: The actual used
space of Longhorn volume is over 90% of the capacity.
Longhorn stores historical data as snapshots that are calculated with the active data for the volume's actual size. This size is therefore greater than the volume's nominal data value.
- PULSE-937 Private-Key field change in Update Remote Write request does not reflect in enabling the metric flow
- When using the Management Console UI for Remote Storage the Disable option does not deactivate the remote write configuration, even when the action returns a positive result message. Therefore, when disabling a remote storage configuration use the CLI client to disable the remote storage configuration directly from the API.
- PULSE-841 Disabling the remote write configuration logs an error in both cp prometheus and env prometheus
- When a metric replication is set up between the cluster and Cloudera Observability and the connection is disabled or deleted, Prometheus writes an error message that states that it cannot replicate the metrics.
- PULSE-895 Disabling the remote write config in the UI is broken in cdp-pvc
- The Remote Write Enable and Disable options in the Management Console’s User Interface do not work when a Remote Storage configuration is created with a requestSignerAuth type from either the HTTP API or using the CDP-CLI tool.
- PULSE-936 No Alert to prompt the metric flow being affected b/c of wrong private key configuration
- A remote write alert was not triggered when the wrong private key was used in a Remote Storage configuration.
Known Issues in Management Console 1.5.1
- External metadata databases are no longer supported on OCP
-
As of CDP Private Cloud Data Services 1.5.1, external Control Plane metadata databases are no longer supported. New installs require the use of an embedded Control Plane database. Upgrades from CDP Private Cloud Data Services 1.4.1 or 1.5.0 to 1.5.1 are supported, but there is currently no migration path from a previous external Control Plane database to the embedded Control Plane database. Upgrades from 1.4.0 or 1.5.0 with external Control Plane metadata databases also require additional steps, which are described in the CDP Private Cloud Data Services 1.5.1 upgrade topics.
- DOCS-18031: Nodes are in "Not Ready" status during Rolling Restart of ECS
-
During a rolling restart of ECS, nodes are in a "Not Ready" state, and the
dmesg
command returns the following error message on the applicable nodes.[Tue Aug 8 16:30:50 2023] nfs: server 10.46.157.145 not responding, timed out [Tue Aug 8 16:31:16 2023] nfs: server 10.46.157.145 not responding, timed out [Tue Aug 8 16:31:30 2023] nfs: server 10.46.157.145 not responding, timed out
Also, the
df
command may hang on these hosts.
- OPSAPS-67214: Single Node | Restart Stability | Rolling start is failing with "global timeout reached: 10m0s, error when evicting pods"
-
For the ECS service, rolling restart is not applicable to a single node cluster.
- CDPVC-1098: How to refresh the YuniKorn configuration
-
Sometimes it is possible for the scheduler state to go out of sync from the cluster state. This may result in pods in Pending and ApplicationRejected states, with pod events showing Placement Rule related errors. To recover from this, you may need to refresh the YuniKorn configuration.
- OPSAPS-67340: L1 runs failing as service monitor is in bad health state
-
Service Monitor (SMON) ends up in a bad health state after restarting the Cloudeara Manager (CM) server, reporting problems with descriptor and metric schema age, when Kerberos and CM SPNEGO authentication are both enabled.
- DOCS-15855: Networking API is deprecated after upgrade to CDP Private Cloud Data Services 1.5.1 (K8s 1.24)
-
When the control plane is upgraded from 1.4.1 to 1.5.1, the Kubernates version changes to 1.24. The Livy pods running in existing Virtual Clusters (VCs) use a deprecated networking API for creating ingress for Spark driver pods. Because the old networking API is deprecated and does not exist in Kubernates 1.24, any new job run will not work for the existing VCs.
- CDPQE-24295: Update docker client on docker.lab.eng.hortonworks machine
-
When you attempt to execute the Docker command to fetch the Cloudera-provided images into your air-gapped environment, you may encounter an issue where Docker pulls an incorrect version of the HAProxy image, especially if you are using an outdated Docker client. This situation arises due to the Cloudera registry containing images with multiple platform versions. Unfortunately, older Docker clients may lack the capability to retrieve the appropriate architecture version, such as amd64.
- OPSX-4326: OCP upgrade from 1.5.0 to 1.5.1 – Restore is unsuccessful after upgrade
-
After upgrading CDP Private Cloud Data Services on OCP from 1.5.0 to 1.5.1, Restore using a 1.5.0 backup could not be performed successfully.
- OPSX-4266: ECS upgrade from 1.5.0 to 1.5.1 is failing in Cadence schema update job
-
When upgrading from ECS 1.5.0 to 1.5.1, the CONTROL_PLANE_CANARY fails with the following error:
Firing alerts for Control Plane: Job did not complete in time, Job failed to complete.
And the
cdp-release-cdp-cadence-schema-update
job fails.
- OPSX-4076:
- When you delete an environment after the backup event, the restore operation for the backup does not bring up the environment.
- OPSX-4295:
- The logs for the backups created in CDP Private Cloud Data Services 1.5.0 version do not appear after you upgrade to version 1.5.1.
- OPSX-4024: CM truststore import into unified truststore should handle duplicate CommonNames
-
If multiple CA certificates with the exact same value for the Common Name field are present in the Cloudera Manager truststore when a Private Cloud Data Services cluster is installed, only one of them may be imported into the Data Services truststore. This may cause certificate errors if an incorrect/old certificate is imported.
- OPSX-2713: PVC ECS Installation: Failed to perform First Run of services
-
If an issue is encountered during the Install Control Plane step of the Containerized Cluster First Run, installation will be re-attempted infinitely rather than the command failing.
- OPSX-3666: mlx_crud_app DB connection fails with error "unable to create connection: x509: certificate relies on legacy Common Name field, use SANs instead"
-
After upgrade, the mlx-crud-app fails with the error "unable to create connection: x509: certificate relies on legacy Common Name field, use SANs instead"
- OPSAPS-66166: FreeIPA cmadminrole needs more privileges for PvC+ after upgrade
-
After upgrade, the Cloudera Manager admin role may be missing the Host Administrators privilege in an upgraded cluster.
- COMOPS-2822: OCP error x509: certificate signed by unknown authority
-
The error
x509: certificate signed by unknown authority
usually means that the Docker daemon that is used by Kubernetes on the managed cluster does not trust the self-signed certificate.
- OPSX-4225: Upgrade failed as cadence pods are crashlooping post upgrade
-
When doing a fresh install of CDP Private Cloud Data Services 1.5.1, external metadata databases are no longer supported. Instead, the CDP Private Cloud Data Services installer will create an embedded database pod by default, which runs inside the Kubernetes cluster to host the databases required for installation.
If you are upgrading from CDP Private Cloud Data Services 1.4.1 or 1.5.0 to 1.5.1, and you were previously using an external database, you must run the following
psql
commands to create the required databases. You should also ensure that the two new databases are owned by the common database users known by the control plane.CREATE DATABASE db-cadence; CREATE DATABASE db-cadence-visibility;
- OPSAPS-67046: Docker Server role fails to come up and returns a connection error during ECS upgrade
-
When upgrading from 1.4.1 to 1.5.1, a Docker server role can sometimes fail to come up and return the following error:
grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc failed to start containerd: timeout waiting for containerd to start
This error appears in the
stderr
file of the command, and can be caused by a mismatch in the pid of containerd.
- Longhorn-4212 Somehow the Rebuilding field inside volume.meta is set to true causing the volume to get stuck in attaching/detaching loop
-
This is a condition that can occur in ECS Longhorn storage.
- OPSX-3073 [ECS] First run command failed at setup storage step with error "Timed out waiting for local path storage to come up"
- Pod stuck in pending state on host for a long time. Error in
Role log related to CNI plugin:
Events:
Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 3m5s (x269 over 61m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "70427e9b26fb014750dfe4441fdfae96cb4d73e3256ff5673217602d503e806f": failed to find plugin "calico" in path [/opt/cni/bin]
- OPSX-3528: [Pulse] Prometheus config reload fails if multiple remote storage configurations exist with the same name
- It is possible to create multiple remote storage configurations with the same name. However, if such a situation occurs, the metrics will not flow to the remote storage as the config reload of the original prometheus will fail.
- OPSX-1405: Able to create multiple CDP PVC Environments with the same name
- If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
- OPSX-1412: Creating a new environment through the CDP CLI reports intermittently that "Environment name is not unique" even though it is unique
- When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
- OPSX-3323: Custom Log Redaction | String is not getting redacted from all places in diagnostic bundle
- Custom redaction rule for URLs does not work.
- Cloudera Data Engineering service fails to start due to Ozone
- If the Ozone service is missing, misconfigured, or having other issues when an Environment is registered in the Management Console, CDE fails to start.
Known Issues in Management Console 1.5.0
- Longhorn-4212 Somehow the Rebuilding field inside volume.meta is set to true causing the volume to get stuck in attaching/detaching loop
-
This is a condition that can occur in ECS Longhorn storage.
- COMPX-13185 Upgrade from 1.4.1 to 1.5.0 failed - error: timed out waiting for the condition on jobs/helm-install-longhorn
- Before ECS upgrade, you must update a specific ECS server node toleration explicitly to ensure a cleaner upgrade process.
- OPSX-3073 [ECS] First run command failed at setup storage step with error "Timed out waiting for local path storage to come up"
- Pod stuck in pending state on host for a long time. Error in
Role log related to CNI plugin:
Events:
Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodSandBox 3m5s (x269 over 61m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "70427e9b26fb014750dfe4441fdfae96cb4d73e3256ff5673217602d503e806f": failed to find plugin "calico" in path [/opt/cni/bin]
- OPSX-3528: [Pulse] Prometheus config reload fails if multiple remote storage configurations exist with the same name
- It is possible to create multiple remote storage configurations with the same name. However, if such a situation occurs, the metrics will not flow to the remote storage as the config reload of the original prometheus will fail.
- OPSX-2062: Platform not shown on the Compute Cluster UI tab
- On the CDP Private Cloud Management Console UI in ECS, when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.
- OPSX-1405: Able to create multiple CDP PVC Environments with the same name
- If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
- OPSX-1412: Creating a new environment through the CDP CLI reports intermittently that "Environment name is not unique" even though it is unique
- When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
- OPSX-2062: Platform not shown on the Compute Cluster UI tab
- On CDP Private Console UI in ECS, when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.
- OPSX-3323: Custom Log Redaction | String is not getting redacted from all places in diagnostic bundle
- Custom redaction rule for URLs does not work.
- Cloudera Data Engineering service fails to start due to Ozone
- If the Ozone service is missing, misconfigured, or having other issues when an Environment is registered in the Management Console, CDE fails to start.
Known Issues in Management Console 1.4.1
- INSIGHT-2469: COE Insight from case 922848: Not able to connect to bit bucket
- After installing CML on an ECS cluster, users were not able to connect the internal bitbucket repo.
- OPSAPS-67046: Docker Server role fails to come up and returns a connection error
-
A Docker server role can sometimes fail to come up and return the following error:
grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc failed to start containerd: timeout waiting for containerd to start
This error appears in the
stderr
file of the command, and can be caused by a mismatch in the pid of containerd. - OPSX-1405: Able to create multiple CDP PVC Environments with the same name
- If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
- OPSX-1412: Creating a new environment through the CDP CLI reports intermittently that "Environment name is not unique" even though it is unique
- When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
- OPSX-3323: Custom Log Redaction | String is not getting redacted from all places in diagnostic bundle
- Custom redaction rule for URLs does not work.
- Cloudera Data Engineering service fails to start due to Ozone
- If the Ozone service is missing, misconfigured, or having other issues when an Environment is registered in the Management Console, CDE fails to start.
Known Issues in Management Console 1.4.0
- Cloudera Data Engineering service fails to start due to Ozone
- If the Ozone service is missing, misconfigured, or having other issues when an Environment is registered in the Management Console, CDE fails to start.
- OPSX-2062: Platform not shown on the Compute Cluster UI tab
- On CDP Private Console UI in ECS, when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.
- OPSX-2713: ECS Installation: Failed to perform First Run of services.
- If an issue is encountered during the Install Control Plane step of Containerized Cluster First Run, installation will be re-attempted infinitely rather than the command failing.
- OPSX-735: Kerberos service should handle CM downtime
- The Cloudera Manager Server in the base cluster must be running in order to generate Kerberos principals for Private Cloud. If there is downtime, you may observe Kerberos-related errors.
- OPSX-1405: Able to create multiple CDP PVC Environments with the same name
- If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
- OPSX-1412: Creating a new environment through the CDP CLI intermittently reports that, "Environment name is not unique" even though it is unique
- When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
- OPSX-2484: FileAlreadyExistsException during timestamp filtering
- The timestamp filtering may result in
FileAlreadyExistsException when there is a file with same name
already existing in the
tmp
directory. - OPSX-2772: For Account Administrator user, update roles functionality should be disabled
- An Account Administrator user holds the biggest set of privileges, and is not allowed to modify via current UI, even user try to modify permissions system doesn't support changing for account administrator.
Known Issues for Management Console 1.3.x and lower
- Recover fast in case of a Node failures with ECS HA
- When a node is deleted from cloud or made unavailable, it is observed that the it takes more than two minutes until the pods were rescheduled on another node.
- Cloudera Manager 7.6.1 is not compatible with CDP Private Cloud Data Servicesversion 1.3.4.
- You must use Cloudera Manager version 7.5.5 with this release.
- CDP Private Cloud Data Services ECS Installation: Failed to perform First Run of services.
- If an issue is encountered during the Install Control Plane step of Containerized Cluster First Run, installation will be re-attempted infinitely rather than the command failing.
- Environment creation through the CDP CLI fails when the base cluster includes Ozone
- Problem: Attempt to create an environment using the CDP command-line interface fails in a CDP Private Cloud Data Services deployment when the Private Cloud Base cluster is in a degraded state and includes Ozone service.
- Filtering the diagnostic data by time range might result in a FileAlreadyExistsException
- Problem:Filtering the collected diagnostic data might result in a FileAlreadyExistsException if the /tmp directory already contains a file by that name.
- Full cluster name does not display in the Register Environment Wizard
- Kerberos service does not always handle Cloudera Manager downtime
- Problem: The Cloudera Manager Server in the base cluster must be running to generate Kerberos principals for CDP Private Cloud. If there is downtime, you might observe Kerberos-related errors.
- Management Console allows registration of two environments of the same name
- Problem: If two users attempt to register environments of the same name at the same time, this might result in an unusable environment.
- Not all images are pushed during upgrade
- A retry of a failed upgrade intermittently fails at the Copy Images to Docker Registry step due to images not being found locally.
- The Environments page on the Management Console UI for an environment in a deployment using ECS does not display the platform name
- Problem: When you view the details of an environment using the Management Console UI in a CDP Private Cloud Data Services deployment using ECS, the Platform field appears blank.
- Updating user roles for the admin user does not update privileges
- In the Management Console, changing roles on the User Management page does not change privileges of the admin user.
- Upgrade applies values that cannot be patched
- If the size of a persistent volume claim in a Containerized Cluster is manually modified, subsequent upgrades of the cluster will fail.
- Incorrect warning about stale Kerberos client configurations
-
If Cloudera Manager is configured to manage krb5.conf, ECS clusters may display a warning that they have stale Kerberos client configurations. Clicking on the warning may show an "Access denied" error.
- Vault becomes sealed
- If a host in an ECS cluster fails or restarts, the Vault may have become sealed. (You may see a Health Test alert in Cloudera Manager for the ECS service stating Vault instance is sealed.)