Fixed Issues for the CDP Private Cloud Data Services Management Console

Fixed Issues in Management Console 1.5.3

PULSE-697: Add node-exporter to PvC DS: Fixed the issue that occured when expanding a cluster with new nodes and there was insufficient CPU and memory resources, the Node Exporter encountered difficulties deploying new pods on the additional nodes.
OPSAPS-69556: 1.5.1->1.5.2 Upgrade[Public Registry+public bits] - fails with ImagePullErrors, registry modified to point to docker-private during upgrade: Previously, when upgrading using the Cloudera public registry with public bits, the Docker registry would incorrectly change to point to docker-private.infra.cloudera.com. This has now been fixed to point to the correct registry.

Fixed Issues in Management Console 1.5.2

OPSX-2062: Platform not shown on the Compute Cluster UI tab: Fixed the issue where on the CDP Private Console UI in ECS, when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.

OPSX-4397: The domain name parameter from environment service has an additional 'apps' subdomain: Previously, the domain name of any environment in an Openshift installation contained an additional "apps." subdomain. This has now been fixed.

OPSX-4407, OPSAPS-67046: Docker Server role fails to come up and returns a connection error during ECS upgrade: Fixed the ECS issue where when upgrading, a Docker server role would sometimes fail to come up and return the following error:

grpc: addrConn.createTransport failed to connect to {unix:///var/run/docker/containerd/containerd.sock <nil> 0 <nil>}. Err :connection error: desc = "transport: error while dialing: dial unix:///var/run/docker/containerd/containerd.sock: timeout". Reconnecting... module=grpc failed to start containerd: timeout waiting for containerd to start

This error appeared in the stderr file of the command, and could be caused by a mismatch in the pid of containerd.

Fixed Issues in Management Console 1.5.1

OPSAPS-65551: Increase default fd limit for ECS: The default maximum process FD limit for the ECS agent and server roles has been set to 1048576 to avoid a "too many open files" error. Changes have been made to EcsParams, EcsAgentRoleHandler and EcsServerRoleHandler.

OPSAPS-65852: Any stop of an ECS role should include a drain: Previously, stopping and starting an ECS Role only stopped and started the role respectively, which caused issues in Kubernetes and Longhorn volume health to turn bad. Now, when a user stops an ECS Role (Server or Agent), we perform a "cordon" followed by a "drain" on the node and then stop the ECS Role on the node. When starting an ECS Role, we first start the ECS Role, then we do an "uncordon" on the node to allow Kubernetes to reuse the node for its workload. A restart operation on ECS Service will perform a Rolling Restart, which does the same steps involved in stopping and starting roles, but one node at a time.

OPSX-3716: Certificates updated against key "undefined" from control plane UI: Previously users were able to upload certificates without choosing a certificate type. This caused certificates to be saved as undefined. This fix now enforces users to choose Certificate Type before they can save the certificate.

OPSAPS-58019: krb5.conf had includedir DIRNAME that caused krb5.conf to not get copied into CML and CDW

Fixed the issue where if the /etc/krb5.conf file on the Cloudera Manager host contained include or includedir directives, Kerberos-related failures sometimes occurred. Expanded the include and includedir contents as part of the krb5.conf content before return to the user so that the files referred by these two directives do not need to be de-referenced by the user.

OPSX-3942, ENGESC-19665: CP logs occupies large amount of disk space

Fixed the issue where control plane logs were taking up a large amount of disk space:

1. Clean up the files created under the /tmp directory after the bundle collection.

2. Include control plane logs while purging. CP logs will be present in the /data/cp directory.

OPSX-3619: Installer exits even with pending pods in single node installation

Fixed the issue with a single node ECS deployment where the installer exited prematurely while pods were still in a pending state.

OPSX-4010: [UI Issue] Deletion Response sent immediately but deletion happens in 1 Min: Fixed the issue where after a backup deletion request from the UI, and a confirmation pop-up, the actual deletion did not occur until approximately one minute later.
ENGESC-20112: Unable to progress ECS Upgrade: Fixed the issue where the Starting ECS Agent command failed during upgrade, but did not support retry, so the upgrade could not be resumed.
OPSX-2062: Platform not shown on the Compute Cluster UI tab: Fixed the issue on the CDP Private Console UI in ECS, where when listing the compute clusters, the Platform field is empty (null) instead of displaying RKE as the Platform.
OPSAPS-66433: Support rolling upgrade for Docker service: Added support for rolling restart of the embedded Docker service. Support rolling upgrade of the Docker service while upgrading Private Cloud.
OPSAPS-66559: Create command to clear pending pods in the cluster: The Refresh ECS command will restart the pods that are in pending state for 10 minutes. This value can be configured using the WAIT_TIME_FOR_POD_READINESS parameter.
OPSAPS-65753: Upgrade CP before upgrading K8S: When upgrading an ECS cluster, the control plane will be upgraded before Kubernetes is upgraded.
PULSE-498: Alerts for Ozone health tests are not reported on the Control Plane dashboard: The Cloudera Monitoring Grafana Control Plane dashboard now displays alerts for the Ozone service.
PULSE-77: schemaVersion should be updated to 3: The topology schema version has been upgraded to version 3, which has stopped the invalid schema version:2 error message appearing in the log files.
PULSE-53: nil pointer reference on calling createSilence API via CDP CLI: You can now silence your alerts from the CDP CLI, which avoids repeated alert pings when troubleshooting issues.

Fixed Issues in Management Console 1.5.0

COMPX-13184 Yunikorn-admission-controller not getting scheduled after restart due to lack of tolerations: Fixed the issue where ecs-webhooks from the ECS platform failed to update the YuniKorn namespace. Due to this lack of toleration update from ECS, YuniKorn will insert this toleration from Liftie deployment as an interim update.

Fixed Issues in Management Console 1.4.0

OPSX-2697 Not all images are pushed in upgrade: Fixed the issue of a retry of an upgrade failing at the Copy Images to Docker Registry step due to images not being found locally.

Fixed Issues in Management Console 1.3.x

CVE-2021-44228 (Apache Log4j 2 vulnerability) has been addressed in CDW on CDP Private Cloud Management Console version 1.4.0: Log4j 2 has been upgraded to version 2.17.
Fix copy-docker-template: Fixed the issue of a retry of only the Push Images to Docker Registry failing due to the image not being available locally.
NFS provisioner fails on cluster with more than ~10 nodes: Fixed longhorn nfs_provisioner failing to start on clusters with more than 10 nodes.
Longhorn for Kubernetes is upgraded to version 1.2.x: Longhorn has been upgraded from version 1.1.2 to 1.2.x
ECS High Availability fails during installation: Fixed an issue where selecting multiple ECS Server hosts during install would randomly result in a installation failure.