Troubleshooting DRS
The troubleshooting scenarios in this topic help you to troubleshoot issues that might appear for DRS in the Cloudera Control Plane. The “Backup and Restore Manager” in Cloudera Data Services on premises Management Console leverages the Data Recovery Service (DRS) capabilities to backup and restore Kubernetes namespaces and resources.
CDP Control Plane UI or the Backup and Restore Manager becomes inaccessible after a failed restore event?
Cloudera Control Plane UI does not come up or the Backup and Restore Manager (or drscp options) becomes inaccessible after a failed restore event.
Sometimes, some configurations take more time to restore. For example, in a shared cluster (OCP) that is heavily loaded, the restore event might surpass the set timeout limit. In this scenario, you can either wait or rerun the restore event again.
- Wait for a minimum of 15 minutes. This might resolve the issue automatically if the issue was caused due to timeout. You can verify this in the logs.
- Run restore again. This might resolve the issue if it was temporary such as, restore event during cluster maintenance.
If the Cloudera Control Plane is not restored successfully even after you follow the steps, contact Cloudera Support for further assistance.
Timeout error appears in Backup and Restore Manager
A timeout error appears in the Backup and Restore Manager (or drscp options) during a restore event.
When the restore event crosses the time set in the POD_CREATION_TIMEOUT environment property of the cdp-release-thunderhead-drsprovider deployment in the drs namespace, a timeout error appears. By default, the property is set to 900 seconds. In this scenario, you must manually verify whether the pods are up or not.
Stale configurations in Cloudera Manager after a restore event
This scenario appears when you take a backup of the Cloudera Data Services on premises Cloudera Control Plane, upgrade Data Services, and then perform a restore. During the upgrade process, new parcels are activated and configurations in Cloudera Manager might have changed.
It is recommended that you restart Cloudera Manager after the upgrade process is complete and then initiate the restore event.
Timeout error during backup of OCP clusters
“The execution of the sync command has timed out" error appears during a backup event for OCP clusters.
This scenario is observed when the cluster is heavily used and the backup event is initiated during peak hours.
You can restart the nodes, this causes the disk to unmount and forces the operating system to write any data in its cache to the disk. After the restart is complete, initiate another backup. If any warnings appear, scrutinize to verify whether there are any dire warnings, otherwise the generated backup is safe to use. The only drawback in this scenario is the downtime impact, that is the time taken to back up the OCP clusters is longer than usual. Therefore, it is recommended that you back up the clusters during non-peak hours.
If the sync errors continue to appear, contact your IT department to check whether there is an issue with the storage infrastructure which might be preventing the sync command from completing on time.