Known issues for the Cloudera Data Services on premises 1.5.5 SP1
You must be aware of the known issues and limitations, the areas of impact, and
workaround in Cloudera Data Services on premises 1.5.5 SP1 release.
Cloudera Data Services on premises 1.5.5 existing known issues are carried
into Cloudera Data Services on premises1.5.5 SP1. For more details, see Known Issues.
Known issue identified in 1.5.5 SP1
The following are the known issues identified in 1.5.5 SP1:
OPSAPS-70612: Invalid URL error while installing Cloudera Data Services on premises 1.5.5 SP1 from Cloudera Manager 7.13.1.501
When attempting to install Cloudera Data Services on premises using a Cloudera Manager 7.13.1.501 hotfix, the installation may fail.
The
installation fails to process the repository URL correctly in this specific scenario,
resulting in an "Invalid URL" error, which blocks the installation.
Perform the following steps to mitigate this issue:
Login to Cloudera Manager and click Hosts > Add Hosts.
In the Add Hosts page, select the Add hosts to
Cloudera Manager option and then click
Continue.
Follow the instructions on the Setup Auto-TLS page and then
click Continue.
In the Specify Hosts page, enter a host name or pattern to
search for new hosts to add to the cluster, and then click
Continue. You can click the Patterns
link for more information.
A list of matching hosts are displayed.
Select the hosts that you want to add and click
Continue.
Select the repository location where Cloudera Manager can access
the software required to install on the new hosts. Choose either the
Public Cloudera Repository or a Custom
Repository, and provide the URL of the custom repository available on
your local network.
Click Continue.
After completing the above tasks, you can proceed with the Cloudera Data Services on premises installation.
COMPX-20437 - DB connection failures causing RPM and CAM pods to
CrashLoopBackOff
During an upgrade from version 1.5.5 to any 1.5.5 hotfix
release, the cluster-access-manager (CAM) and resource-pool-manager (RPM)
pods can enter a CrashLoopBackOff state if they are not
automatically restarted during the upgrade.
After upgrade, manually restart the CAM pod and
then restart the RPM pod (order of restart is very important).
OBS-9491 - Prometheus configuration exceeds size limit in large
environments
In environments with a large number of namespaces (approximately
300 or more per environment), the Prometheus configuration for Cloudera Monitoring might exceed the 1 MB
Kubernetes Secret size limit. If the total size, which depends on factors such as the
number of namespaces, the length of namespace names, their variability, and the size of
the certificate store, exceeds 1 MB, the new Prometheus configuration will not be
applied, and new namespaces will not be monitored. As a result, the telemetry data will
not be collected from those namespaces and will not be reflected on the corresponding
Grafana charts.
To resolve this issue, you must enable Prometheus configuration compression
at the control plane level.
Upgrade to Cloudera Data Services on premises 1.5.5 SP1
or a higher version.
Set the environment variable
ENABLE_ENVIRONMENT_PROMETHEUS_CONFIG_COMPRESSION to
"true" on the
cdp-release-monitoring-pvcservice deployment in the Cloudera Control Plane namespace.
OPSX-6618 - In Cloudera Embedded Container Service upgrade not
all volumes are upgraded to the latest longhorn version
During restart of the Cloudera Embedded Container Service
cluster from 1.5.5 to 1.5.5 SP1, the upgrade failed due to longhorn health issues. This
is because one of the volumes was degraded.
Follow the steps to resolve the issue:
Identify the problematic volumes (degraded state).
Set the value of spec.numberOfReplicas of the volume to the
number of active replicas. For example, set the value to 3 if
two replicas are active.
Apply the fix before or during the upgrade as per the workaround instructions
(refer to Longhorn issue #11825). Longhorn Engineering is addressing in v1.11.0 and
will back-port the fix. The workaround is included as part of the upgrade. However,
if the issue is noticed even after upgrade please follow the workaround steps
documented here: https://github.com/longhorn/longhorn/issues/11825
OPSX-6566 - Cloudera Embedded Container Service restart fails
with etcd connectivity issues
Restart of the Cloudera Embedded Container Service server fails
with etcd error: "error reading from server"
Identify the server role that failed.
Restart only the Cloudera Embedded Container Service server role which
failed.
Once the server role is restarted and healthy, proceed with the remainder of the
Cloudera Embedded Container Service server role restart sequence.
OPSX-6401 - Istio ingress-default-cert is not created in the
upgrade scenario
After upgrading to 1.5.5 SP1, the Secret
ingress-default-cert is not created in the
istio-ingress namespace. Because this certificate is expected,
failing to create it causes components like CAII & MR provisioning to fail.
To resolve this issue:
After upgrading from an older version to 1.5.5 SP1, check for the secret called
ingress-default-cert in the istio-ingress
namespace (ns).
For
example:
kubectl get secret ingress-default-cert -n istio-ingress
If missing, copy the identical secret from kube-system namespace
(or from the pre-upgrade environment) into istio-ingress namespace
with the same name and contents.
This will restore the expected certificate and allow CAII & MR provisioning to
proceed.
OPSX-6645 - Cloudera Embedded Container Service upgrade failure
at restart step
When the Cloudera Embedded Container Service role fails to start
after a node reboot or service restart, the root cause can be that the etcd
defragmentation process which runs on startup takes longer than the component timeout
thresholds. As a result:
The kubelet service may fail to start or time out.
The kube-apiserver,
kube-scheduler or
kube-controller-manager roles may remain in a
NotReady state.
etcd may perform automatic defragmentation at startup.
The API server may fail to connect to etcd (connection refused or timeout).
To fix this issue:
Resume the Cloudera Embedded Container Service start/restart from the Cloudera Manager UI once etcd has come up.
Ensure etcd meets production hardware and configuration requirements: For more
information, see etcd requirements and Knowledge Base.
OPSX-6767 - Cloudera Embedded Container Service cluster has
stale configuration after Cloudera Manager upgrade to
7.13.1.501-b2 from 7.11.3.24
After upgrading Cloudera Manager to
version 7.13.1.501, the Cloudera Embedded Container Service shows a
staleness indicator. This occurs due to configuration
changes applied by the upgrade:
worker-shutdown-timeout: reduced from 24 hours (86,400 s) to 15
minutes (900 s).
smon_host: a new monitoring configuration added.
smon_port: a new monitoring port configuration (9997).
No action required — the
staleness is expected following this upgrade and can be
safely ignored. The indicator will automatically clear once Cloudera Embedded Container Service is upgraded to version 1.5.5 SP1 or later.
If you
prefer to clear the staleness indicator right away, you may manually refresh the Cloudera Embedded Container Service service through the Cloudera Manager UI.
OPSX-6638 - Post rolling restart many pods are stuck in pending
state
Pods remain in Pending state and fail to
schedule with etcd performance warnings.