Known issues for Cloudera Data Services on premises 1.5.5 SP3

Review the known issues for Cloudera Data Services on premises, the impact or changes to the functionality, and the applicable workaround.

Known issues identified in 1.5.5 SP3 release

OPSAPS-77943: Cloudera Control Plane installation fails with an error message cdp-embedded-db-0 is not Ready
Longhorn does not support using a symlink for the default data path/storage directory. The path must be a real directory on the host; symlinked paths can cause disk detection failures, replica scheduling issues, and volume attach/mount failures.
None
OPSAPS-78075: In Cloudera Manager, selecting Use Default Configuration does not show the input for Administrator credentials
In the Cloudera Embedded Container Service environment, when the Use Default Configuration checkbox is selected on the Cluster Basic UI, the Configure default login credentials for Control Plane panel does not display in Configure Data Services UI. Without configuring the Administrator credentials Cloudera Embedded Container Service installation for Cloudera Data Services on premises 1.5.5 SP3 release fails.
Note the following process to overcome this scenario.
  • Proceed without selecting the Use Default Configuration checkbox.
  • On the Configure Data Services page - Configure Default Admin Credentials for Control Plane is displayed.
OPSX-7794: The docker copy script fails checksum validation on Docker version 29.5.2
While using the Custom Docker Repository Installation and Upgrade process, selecting the Docker version 27+ for the copy docker script (copy-docker.txt) fails checksum validation. An error message is thrown - does not match for each image it tries to copy.
Switch to Docker Version 26.
OPSAPS-77489: When calling rotateEcsCertificates API command directly, 500 Server error message is thrown
Invoking the Cloudera Manager API, /clusters/{clusterName}/services/{serviceName}/commands/rotateEcsCertificates results in a 500 server error.
Invoke the following Cloudera Manager API call to run the same command
/clusters/{clusterName}/services/{serviceName}/commands/RotateEcsCerts
OPSX-6950: Data Recovery Service based Restore job fails because the services in cert-manager namespace is not created

Data Recovery Service Restore job fails due to ClusterIP Allocation Conflict. During a Data Recovery Service restore process, the operation may fail with a Kubernetes error indicating that a Service ClusterIP is already allocated. This occurs when the restore process attempts to recreate a service using a ClusterIP that is currently in use by another existing service in the cluster. A typical error message looks like: Service "cdp-release-cert-manager-cainjector" is invalid: spec.clusterIPs: failed to allocate IP <IP_ADDRESS>: provided IP is already allocated

Identify which service is currently using the conflicting IP address: kubectl get svc -A -o wide | grep <IP_ADDRESS> Retry the DRS restore. On retry, the restore process will clean up existing resources (including the service currently holding the IP) and later proceed with the restore process. If the Cloudera Control Plane UI is not accessible, please contact the Cloudera Support for assistance.
OPSX-7767: While performing an Cloudera Embedded Container Service upgrade from Cloudera Data Services on premises 1.5.5 SP2 → 155SP3, the upgrade process may fail during the restart step with a message - failed to reconcile with local datastore: context deadline exceeded
During an Cloudera Embedded Container Service upgrade from Cloudera Data Services on premises 1.5.5 SP2 to 1.5.5 SP3, the upgrade process may fail in the restart step with the following error: failed to bootstrap cluster data: failed to reconcile with local datastore: context deadline exceeded.

When the ECS_SERVER role is stopped as part of the restart during the upgrade workflow, not all RKE2-related processes are terminated. Some of the orphaned processes continue to run, including:

  • etcd
  • kube-proxy
  • kube-apiserver

These processes are left under containerd static pods. With no active rke2 server process and ports 2379/6443 still in use, a subsequent Cloudera Manager start of rke2 server fails bootstrap reconcile with the error: failed to reconcile with local datastore: context deadline exceeded.

To overcome this situation:
  1. Run rke2-killall on the affected Cloudera Embedded Container Service server host to stop the orphaned processes.
  2. Resume the upgrade process from Cloudera Manager.
  3. The ECS_SERVER / RKE2 will show up without any errors and the upgrade process completes successfully.
OPSX-6858: Cloudera Embedded Container Service first run is stuck in loop at install-cp step (mke2fs command fails) on KCloud

During a few Cloudera Embedded Container Service installation processes, pods get stuck in the creating state as the associated longhorn mount on those pods fails.

This scenario is observed when longhorn PVC block devices contain stale filesystem/partition metadata from the previous use. This event can be verified by running lsof /dev/longhorn/pvc-2e2dc23b-82d6-45cd-9348-b40eba0fb4e1.

Because of this scenario, mke2fs command fails which is required for setting up volume mounts on the pod

You can manually run the wipefs command on the node where the pod is running. For example, - wipefs /dev/longhorn/pvc-2e2dc23b-82d6-45cd-9348-b40eba0fb4e1