Troubleshooting DRS
The troubleshooting scenarios in this topic help you to troubleshoot issues that might appear for DRS in the Cloudera Control Plane. The “Backup and Restore Manager” in Cloudera Data Services on premises Management Console leverages the Data Recovery Service (DRS) capabilities to backup and restore Kubernetes namespaces and resources.
Cloudera Control Plane UI or the Backup and Restore Manager becomes inaccessible after a failed restore event?
Problem
Cloudera Control Plane UI does not come up or the Backup and Restore Manager (or drscp options) becomes inaccessible after a failed restore event.
Cause
Sometimes, some configurations take more time to restore. For example, in a shared cluster (OCP) that is heavily loaded, the restore event might surpass the set timeout limit. In this scenario, you can either wait or rerun the restore event again.
Solution
- Wait for a minimum of 15 minutes. This might resolve the issue automatically if the issue was caused due to timeout. You can verify this in the logs.
- Run restore again. This might resolve the issue if it was temporary such as, restore event during cluster maintenance.
If the Cloudera Control Plane is not restored successfully even after you follow the steps, contact Cloudera Support for further assistance.
Timeout error appears in Backup and Restore Manager
Problem
A timeout error appears in the Backup and Restore Manager or in CDP CLI (drscp) setup during a restore event.
Solution
When the restore event crosses the time set in the POD_CREATION_TIMEOUT environment property of the cdp-release-thunderhead-drsprovider deployment in the [***CLOUDERA INSTALLATION NAMESPACE***]-drs namespace, a timeout error appears. By default, the property is set to 900 seconds. In this scenario, you must manually verify whether the pods are up or not.
Timeout error during backup of OCP clusters
Problem
“The execution of the sync command has timed out" error appears during a backup event for OCP clusters.
Cause
This scenario is observed when the cluster is heavily used and the backup event is initiated during peak hours.
Solution
You can restart the nodes, this causes the disk to unmount and forces the operating system to write any data in its cache to the disk. After the restart is complete, initiate another backup. If any warnings appear, scrutinize to verify whether there are any dire warnings, otherwise the generated backup is safe to use. The only drawback in this scenario is the downtime impact, that is the time taken to back up the OCP clusters is longer than usual. Therefore, it is recommended that you back up the clusters during non-peak hours.
If the sync errors continue to appear, contact your IT department to check whether there is an issue with the storage infrastructure which might be preventing the sync command from completing on time.
Stale configurations in Cloudera Manager after a restore event
Cause
This scenario appears when you take a backup of the Cloudera Data Services on premises Cloudera Control Plane, upgrade Data Services, and then perform a restore. During the upgrade process, new parcels are activated and configurations in Cloudera Manager might have changed.
Solution
It is recommended that you restart Cloudera Manager after the upgrade process is complete and then initiate the restore event.
Existing namespaces are not deleted automatically after the restore event
Problem
The existing Liftie and monitoring namespaces in the environment (env2) are not deleted automatically when you perform the following steps:
- You take a backup of an environment ( env1).
- After the backup event is complete, you create another environment (env2).
- You restore the previously taken backup after which the Control Plane
has only env1.
The existing Liftie and monitoring namespaces in the env2 environment are not deleted after the Control Plane backup is restored.
Solution
Ensure that you manually delete the Liftie and monitoring namespaces of the env2 environment after the env1 backup is restored.
Backup event fails during volume snapshot creation process
Problem
The backup event fails during the volume snapshot creation process due to an error similar to Failed to check and update snapshot content: failed to take snapshot of the volume pvc-9d66b458-e10d-4d9c-a.
Cause
This issue might appear when multiple parallel jobs are running to take the volume snapshots of the same volume or might be because of latency issues.
Solution
Retry the backup event after a few minutes.
Restore event for an environment backup fails with an exception
Problem
When you delete an environment after the backup event, the restore operation for the environment fails and the Not able to fetch details from Cluster:... exception appears.
Cause
During the environment creation process, the environment service creates an internal Cloudera Manager user with Full Administrator role. The username is stored in the Cloudera Control Plane database, and the password is stored in the vault. When you delete an environment, the internal Cloudera Manager user gets deleted. The exception appears only if the password is no longer valid or might be missing. One of the reasons why the password might go missing is that while fixing a vault corruption, the vault might have been rebuilt without fixing the Cloudera Manager credentials.
