Known issues for the Cloudera Data Services on premises 1.5.5
Review the known issues and limitations, the areas of impact, and workaround in Cloudera Data Services on premises 1.5.5 release.
Known Issues in Cloudera Data Services on premises 1.5.5
- OBS-8038: When using the Grafana Dashboard URL shortener, the shortened URL defaults to localhost:3000. This behaviour happens because the URL shortener uses the local server address instead of the actual domain name of the Cloudera Observability instance. As a result, users cannot access the shortened URL.
- You must not use the shortened URL. To ensure users can access the URL, update it to use the correct Cloudera Observability instance domain name, such as cp_domain/{shorten_url}{}.
- DWX-20809: Cloudera Data Services on premises installations on RHEL 8.9 or lower versions may encounter issues
- You may notice issues when installing Cloudera Data Services on premises on Cloudera
Embedded Container Service clusters running on RHEL 8.9 or lower versions. Pod
crashloops are noticed with the following
error:
The issue is due to a memory leak with 'seccomp' (Secure Computing Mode) in the Linux kernel. If your kernel version is not on 6.2 or higher verisons or if it is not part of the list of versions mentioned here, you may face issues during installation.Warning FailedCreatePodSandBox 1s (x2 over 4s) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown
- COMPX-20705: [153CHF-155] Post ECS upgrade pods are stuck in ApplicationRejected State
- After upgrading the Cloudera installation pods on Kubernetes could be left in a failure state showing "ApplicationRejected". This is caused by a delay in settings being applied to Kubernetes as part of the post upgrade steps.
- OPSX-6303 - ECS server went down - 'etcdserver: mvcc: database space exceeded'
-
ECS server may fail with error message - "etcdserver: mvcc: database space exceeded" in large clusters.
- OPSX-6295 - Control Plane upgrade failing with cadence-matching and cadence-history
-
Incase of extra cadence-matching and cadence-history pod stuck in Init:CreateContainerError state , Cloudera Embedded Container Service Upgrade to 1.5.5 will be stuck in retry loop because of all pods running validation failure.
- OPSX-4391 - External docker cert not base64 encoded
- When using Cloudera Data Services on premises
on ECS, in some rare situations, the CA certificate for the Docker registry in the
cdpnamespace is incorrectly encoded, resulting in TLS errors when connecting to the Docker registry.
- OPSX-6245 - Airgap | Multiple pods are in pending state on rolling restart
-
Performing back-to-back rolling restarts on ECS clusters can intermittently fail during the Vault unseal step. During rapid consecutive rolling restarts, the kube-controller-manager pod may not return to a ready state promptly. This can cause a cascading effect where other critical pods, including Vault, fails to initialize properly. As a result, the unseal Vault step fails.
- OPSX-4684 - Start ECS command shows green(finished) even though start docker server failed on one of the hosts
- Docker service starts with one or more docker roles failed to start because the corresponding host is unhealthy.
- OPSX-5986 - ECS fresh install failing with helm-install-rke2-ingress-nginx pod failing to come into Completed state
- ECS fresh install fails at the "Execute command Reapply All Settings to Cluster on service ECS" step due to a timeout waiting for helm-install.
- OPSX-6298 - Issue on service namespace cleanup
-
There might be cases in which uninstalling services from the Cloudera Data Services on premises UI will fail due to various reasons.
- OPSX-6265 - Setting inotify max_user_instances config
-
We cannot recommend an exact value for inotify max_user_instances config. It depends on all workloads that are run in a given node.
- COMPX-20362 - Use API to create a pool that has a subset of resource types
-
The Resource Management UI supports displaying only three resource types: CPU, memory and GPU. The Resource Management UI will always set all three resource types it knows about: CPU, Memory and GPU (K8s resource nvidia.com/gpu) when creating a quota. If no value is chosen for a resource type a value of 0 will be set, blocking the use of that resource type.
- OPSX-6952: Upgrade failure during post ECS upgrade control plane validation step
-
During the upgrade process, the cluster operates with reduced resource capacity when the cluster restart job restarts all the nodes in the cluster. In this scenario, the pod placement might get rejected and a few pods might end up in the "Error" state due to preemption. This results in an upgrade failure during the post ECS upgrade control plane validation step.
The following sample snippet shows the pod's description of the pod in the "Error" state:Reason: Preempting Message: Preempted in order to admit critical pod
Known issues from previous releases carried in Cloudera Data Services on premises 1.5.5
Known Issues identified in 1.5.4
- DOCS-21833: Orphaned replicas/pods are not getting auto cleaned up leading to volume fill-up issues
-
By default, Longhorn will not automatically delete the orphaned replica directory. One can enable the automatic deletion by setting orphan-auto-deletion to true.
- OPSX-5310: Longhorn engine images were not deployed on ECS server nodes
- Longhorn engine images were not deployed on ECS server nodes due to missing tolerations for Cloudera Control Plane taints. This caused the engine DaemonSet to schedule only on ECS agent nodes, preventing deployment on Cloudera Control Plane nodes.
- OPSX-5155: OS Upgrade | Pods are not starting after the OS upgrade from RHEL 8.6 to 8.8
- After an OS upgrade and start of the Cloudera Embedded Container Service service, pods fail to come up due to stale state.
- OPSX-5055: Cloudera Embedded Container Service upgrade failed at Unseal Vault step
-
During an Cloudera Embedded Container Service upgrade from 1.5.2 to 1.5.4 release, the vault pod fails to start due to an error caused by the Longhorn volume unable to attach to the host. The error is as below:
Warning FailedAttachVolume 3m16s (x166 over 5h26m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-0ba86385-9064-4ef9-9019-71976b4902a5" : rpc error: code = Internal desc = volume pvc-0ba86385-9064-4ef9-9019-71976b4902a5 failed to attach to node host-1.cloudera.com with attachmentID csi-7659ab0e6655d308d2316536269de47b4e66062539f135bf6012bfc8b41fc345: the volume is currently attached to different node host-2.cloudera.com
- OPSX-4684: Start Cloudera Embedded Container Service command shows green(finished) even though start docker server failed on one of the hosts
-
The Docker service starts, but one or more Docker roles fail to start because the corresponding host is unhealthy.
- OPSX-735: Kerberos service should handle Cloudera Manager downtime
-
The Cloudera Manager Server in the base cluster operates to generate Kerberos principals for Cloudera on premises. If there is downtime, you may observe Kerberos-related errors.
Known Issues identified in 1.5.2
- OPSX-4594: [ECS Restart Stability] Post rolling restart few volumes are in detached state (vault being one of them)
-
After rolling restart there may be some volumes in detached state.
- OPSX-4392: Getting the real client IP address in the application
- CML has a feature for adding the audit event for each user action (Monitoring User Events). In Private Cloud, instead of the client IP, we are getting the internal IP, which is logged into the internal DB.
- CDPVC-1137, CDPAM-4388, COMPX-15083, and COMPX-15418: OpenShift Container Platform version upgrade from 4.10 to 4.11 fails due to a Pod Disruption Budget (PDB) issue
- PDB can prevent a node from draining which makes the nodes to
report the
Ready,SchedulingDisabled
state. As a result, the node is not updated to correct the Kubernetes version when you upgrade OpenShift Container Platform from 4.10 to 4.11.
