Known issues for the CDP Private Cloud Data Services Management Console

This section lists known issues that you might run into while using the CDP Private Cloud Management Console service.

Known Issues in Management Console 1.5.0

OPSX-4560: ECS server restart failed in an air gap environment as it requires "yum install" from repo
ECS service restart may fail in an air gap environment with a yum download error.
Workaround:

Open the /opt/cloudera/cm-agent/service/ecs/rke.sh file with a text editor and remove the following line:

yum -y install iscsi-initiator-utils nfs-utils
Longhorn-4212 Somehow the Rebuilding field inside volume.meta is set to true causing the volume to get stuck in attaching/detaching loop

This is a condition that can occur in ECS Longhorn storage.

Since the volume has only 1 replica in this case, we can:

1. Scale down the workload. The Longhorn volume will be detached.

2. Wait for the Longhorn volume to be detached.

3. SSH into the node that has the replica.

4. cd into the replica folder (for example, /longhorn/replicas/pvc-126d40e2-7bff-4679-a310-e444e84df267-1a5dc941).

5. Change the"Rebuilding" field from true to false in the volume.meta file.

6. Scale up the workload to attach the volume.

COMPX-13185 Upgrade from 1.4.1 to 1.5.0 failed - error: timed out waiting for the condition on jobs/helm-install-longhorn
Before ECS upgrade, you must update a specific ECS server node toleration explicitly to ensure a cleaner upgrade process.

Delete the cni directory on the host failing to launch pods:

ssh root@ecs-ha1-p-7.vpc.cloudera.com rm -rf /var/lib/cni

Before ECS upgrade, run the following commands on the ECS Server hosts:

TOLERATION='{"spec": { "template": {"spec": { "tolerations": [{ "effect": "NoSchedule","key": "node-role.kubernetes.io/control-plane","operator": "Exists" }]}}}}'

kubectl patch deployment/yunikorn-admission-controller -n yunikorn -p "$TOLERATION"

kubectl patch deployment/yunikorn-scheduler -n yunikorn -p "$TOLERATION"
OPSX-3073 [ECS] First run command failed at setup storage step with error "Timed out waiting for local path storage to come up"
Pod stuck in pending state on host for a long time. Error in Role log related to CNI plugin:

Events:

   Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Warning  FailedCreatePodSandBox  3m5s (x269 over 61m)  kubelet  (combined from similar events): 
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 
"70427e9b26fb014750dfe4441fdfae96cb4d73e3256ff5673217602d503e806f": 
failed to find plugin "calico" in path [/opt/cni/bin] 

Delete the cni directory on the host failing to launch pods:

ssh root@ecs-ha1-p-7.vpc.cloudera.com rm -rf /var/lib/cni

Restart the canal pod running on that host:

kubectl get pods -n kube-system -o wide | grep  ecs-ha1-p-7.vpc.cloudera.com
kube-proxy-ecs-ha1-p-7.vpc.cloudera.com                 1/1     Running     0          11h   10.65.52.51    ecs-ha1-p-7.vpc.cloudera.com   <none>           <none>
rke2-canal-llkc9                                        2/2     Running     0          11h   10.65.52.51    ecs-ha1-p-7.vpc.cloudera.com   <none>           <none>
rke2-ingress-nginx-controller-dqtz8                     1/1     Running     0          11h   10.65.52.51    ecs-ha1-p-7.vpc.cloudera.com   <none>           <none>
kubectl delete pod rke2-canal-llkc9 -n kube-system
OPSX-3528: [Pulse] Prometheus config reload fails if multiple remote storage configurations exist with the same name
It is possible to create multiple remote storage configurations with the same name. However, if such a situation occurs, the metrics will not flow to the remote storage as the config reload of the original prometheus will fail.
At any point in time, there should never be multiple remote storage configurations existing that have the same name.
OPSX-3716: Certificates updated against key "undefined" from control plane UI
When uploading an additional CA certificate the form doesn't enforce to choose the certificate type and it could be uploaded with 'undefined' type.
Choose the given CA certificate type.
OPSX-2062: Platform not shown on the Compute Cluster UI tab
On the CDP Private Cloud Management Console UI in ECS, when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.
OPSX-3792
The data recovery service lets you delete the backup while the restore event is in progress.
OPSX-3794
You cannot restore a backup if the CDP User Management System (UMS) is not running.
OPSX-1405: Able to create multiple CDP PVC Environments with the same name
If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
Delete the environment and try again with only one user trying to create the environment.
OPSX-1412: Creating a new environment through the CDP CLI reports intermittently that "Environment name is not unique" even though it is unique
When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
Delete the existing environment, wait 5 minutes, and try again.
OPSX-2062: Platform not shown on the Compute Cluster UI tab
On CDP Private Console UI in ECS, when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.
OPSX-3323: Custom Log Redaction | String is not getting redacted from all places in diagnostic bundle
Custom redaction rule for URLs does not work.
Cloudera Data Engineering service fails to start due to Ozone
If the Ozone service is missing, misconfigured, or having other issues when an Environment is registered in the Management Console, CDE fails to start.
Workaround:
  1. Correct the issues with the Ozone service.
  2. Ensure that Ozone is running as expected.
  3. Re-create the environment.
  4. Create a new Cloudera Data Engineering service.

Known Issues in Management Console 1.4.1

INSIGHT-2469: COE Insight from case 922848: Not able to connect to bit bucket
After installing CML on an ECS cluster, users were not able to connect the internal bitbucket repo.
Workaround:

In this case the MTU of the ECS virtual network interfaces were larger than that of host external interface, which may cause the network requests from ECS containers to get truncated.

The Container Network Interface (CNI) is a framework for dynamically configuring networking resources. CNI integrates smoothly with Kubenetes to enable the use of an overlay or underlay network to automatically configure the network between pods. Cloudera ECS uses Calico as the CNI network provider.

The MTU of the pods’ virtual network interface can be seen by running the ifconfig command.

The default MTU of the virtual network interfaces is 1450.

The MTU setting of the virtual interfaces is stored as a configmap in the kube-system namespace. To modify the MTU, edit the rke2-canal-config configmap.

$ /var/lib/rancher/rke2/bin/kubectl --kubeconfig 
 /etc/rancher/rke2/rke2.yaml --namespace kube-system 
 edit cm rke2-canal-config

Find the veth_mtu parameter in the YAML content. Modify the default value of 1450 to the required MTU size.

Next, restart the rke2-canal pods from the kube-system namespace. There will be rke2-canal pods for each ECS node.

After the pods are restarted, all subsequent new pods will use the new MTU setting. However, existing pods that are already running will remain on the old MTU setting. Restart all of the pods to apply the new MTU setting.

OPSX-1405: Able to create multiple CDP PVC Environments with the same name
If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
Delete the environment and try again with only one user trying to create the environment.
OPSX-1412: Creating a new environment through the CDP CLI reports intermittently that "Environment name is not unique" even though it is unique
When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
Delete the existing environment, wait 5 minutes, and try again.
OPSX-3323: Custom Log Redaction | String is not getting redacted from all places in diagnostic bundle
Custom redaction rule for URLs does not work.
Cloudera Data Engineering service fails to start due to Ozone
If the Ozone service is missing, misconfigured, or having other issues when an Environment is registered in the Management Console, CDE fails to start.
Workaround:
  1. Correct the issues with the Ozone service.
  2. Ensure that Ozone is running as expected.
  3. Re-create the environment.
  4. Create a new Cloudera Data Engineering service.

Known Issues in Management Console 1.4.0

Cloudera Data Engineering service fails to start due to Ozone
If the Ozone service is missing, misconfigured, or having other issues when an Environment is registered in the Management Console, CDE fails to start.
Workaround:
  1. Correct the issues with the Ozone service.
  2. Ensure that Ozone is running as expected.
  3. Re-create the environment.
  4. Create a new Cloudera Data Engineering service.
OPSX-2062: Platform not shown on the Compute Cluster UI tab
On CDP Private Console UI in ECS, when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.
None.
OPSX-2713: ECS Installation: Failed to perform First Run of services.
If an issue is encountered during the Install Control Plane step of Containerized Cluster First Run, installation will be re-attempted infinitely rather than the command failing.
Since the control plane is installed and uninstalled in a continuous cycle, it is often possible to address the cause of the failure while the command is still running, at which point the next attempted installation should succeed. If this is not successful, abort the First Run command, delete the Containerized Cluster, address the cause of the failure, and retry from the beginning of the Add Cluster wizard. Any nodes that are re-used must be cleaned before re-attempting installation.
OPSX-735: Kerberos service should handle CM downtime
The Cloudera Manager Server in the base cluster must be running in order to generate Kerberos principals for Private Cloud. If there is downtime, you may observe Kerberos-related errors.
Resolve downtime on Cloudera Manager. If you encountered Kerberos errors, you can retry the operation (such as retrying creation of the Virtual Warehouse).
OPSX-1405: Able to create multiple CDP PVC Environments with the same name
If two users try to create an environment with the same name at the same time, it might result in an unusable environment.
Delete the environment and try again with only one user trying to create the environment.
OPSX-1412: Creating a new environment through the CDP CLI intermittently reports that, "Environment name is not unique" even though it is unique
When multiple users try to create the same environment at the same time or use automation to create an environment with retries, create environment may fail on collision with a previous request to create an environment.
Delete the existing environment, wait 5 minutes, and try again.
OPSX-2484: FileAlreadyExistsException during timestamp filtering
The timestamp filtering may result in FileAlreadyExistsException when there is a file with same name already existing in the tmp directory.
OPSX-2772: For Account Administrator user, update roles functionality should be disabled
An Account Administrator user holds the biggest set of privileges, and is not allowed to modify via current UI, even user try to modify permissions system doesn't support changing for account administrator.

Known Issues for Management Console 1.3.x and lower

Recover fast in case of a Node failures with ECS HA
When a node is deleted from cloud or made unavailable, it is observed that the it takes more than two minutes until the pods were rescheduled on another node.
It takes some time for the nodes to recover. Failure detection and pod-transitioning are not instantaneous.
Cloudera Manager 7.6.1 is not compatible with CDP Private Cloud Data Servicesversion 1.3.4.
You must use Cloudera Manager version 7.5.5 with this release.
CDP Private Cloud Data Services ECS Installation: Failed to perform First Run of services.
If an issue is encountered during the Install Control Plane step of Containerized Cluster First Run, installation will be re-attempted infinitely rather than the command failing.
Workaround: Since the control plane is installed and uninstalled in a continuous cycle, it is often possible to address the cause of the failure while the command is still running, at which point the next attempted installation should succeed. If this is not successful, abort the First Run command, delete the Containerized Cluster, address the cause of the failure, and retry from the beginning of the Add Cluster wizard. Any nodes that are re-used must be cleaned before re-attempting installation.
Environment creation through the CDP CLI fails when the base cluster includes Ozone
Problem: Attempt to create an environment using the CDP command-line interface fails in a CDP Private Cloud Data Services deployment when the Private Cloud Base cluster is in a degraded state and includes Ozone service.
Workaround: Stopping the Ozone service temporarily in the Private Cloud Base cluster during environment creation prevents the control plane from using Ozone as a logging destination, and avoids this issue.
Filtering the diagnostic data by time range might result in a FileAlreadyExistsException
Problem:Filtering the collected diagnostic data might result in a FileAlreadyExistsException if the /tmp directory already contains a file by that name.
There is currently no workaround for this issue.
Full cluster name does not display in the Register Environment Wizard
None
Kerberos service does not always handle Cloudera Manager downtime
Problem: The Cloudera Manager Server in the base cluster must be running to generate Kerberos principals for CDP Private Cloud. If there is downtime, you might observe Kerberos-related errors.
Resolve downtime issues on Cloudera Manager. If you encounter Kerberos errors, you can retry the concerned operation such as creating Virtual Warehouses.
Management Console allows registration of two environments of the same name
Problem: If two users attempt to register environments of the same name at the same time, this might result in an unusable environment.
Delete the environment and ensure that only one user attempts to register a new environment.
Not all images are pushed during upgrade
A retry of a failed upgrade intermittently fails at the Copy Images to Docker Registry step due to images not being found locally.
The failed images can be loaded manually (with a docker load), and the upgrade resumed. To identify which images need to be loaded take a look at the stderr file. The downloaded images are present in the Docker Data Directory.
The Environments page on the Management Console UI for an environment in a deployment using ECS does not display the platform name
Problem: When you view the details of an environment using the Management Console UI in a CDP Private Cloud Data Services deployment using ECS, the Platform field appears blank.
Use the relevant CDP CLI command from the environments module to view the required details.
Updating user roles for the admin user does not update privileges
In the Management Console, changing roles on the User Management page does not change privileges of the admin user.
None
Upgrade applies values that cannot be patched
If the size of a persistent volume claim in a Containerized Cluster is manually modified, subsequent upgrades of the cluster will fail.
Incorrect warning about stale Kerberos client configurations

If Cloudera Manager is configured to manage krb5.conf, ECS clusters may display a warning that they have stale Kerberos client configurations. Clicking on the warning may show an "Access denied" error.

No action is needed. ECS clusters do not require Kerberos client configurations to be deployed on those hosts.
Vault becomes sealed
If a host in an ECS cluster fails or restarts, the Vault may have become sealed. (You may see a Health Test alert in Cloudera Manager for the ECS service stating Vault instance is sealed.)
Unseal the Vault. In the Cloudera Manager Admin Console, go to the ECS service and click Actions > Unseal .