Troubleshooting DRS

The troubleshooting scenarios in this topic help you to troubleshoot issues that might appear for DRS in the Cloudera Control Plane. The “Backup and Restore Manager” in Cloudera Data Services on premises Management Console leverages the Data Recovery Service (DRS) capabilities to backup and restore Kubernetes namespaces and resources.

Cloudera Control Plane UI or the Backup and Restore Manager becomes inaccessible after a failed restore event?

Problem

Cloudera Control Plane UI does not come up or the Backup and Restore Manager (or drscp options) becomes inaccessible after a failed restore event.

Cause

Sometimes, some configurations take more time to restore. For example, in a shared cluster (OCP) that is heavily loaded, the restore event might surpass the set timeout limit. In this scenario, you can either wait or rerun the restore event again.

Solution

You can perform one of the following steps after a failed restore event:
  • Wait for a minimum of 15 minutes. This might resolve the issue automatically if the issue was caused due to timeout. You can verify this in the logs.
  • Run restore again. This might resolve the issue if it was temporary such as, restore event during cluster maintenance.

If the Cloudera Control Plane is not restored successfully even after you follow the steps, contact Cloudera Support for further assistance.

Timeout error appears in Backup and Restore Manager

Problem

A timeout error appears in the Backup and Restore Manager or in CDP CLI (drscp) setup during a restore event.

Solution

When the restore event crosses the time set in the POD_CREATION_TIMEOUT environment property of the cdp-release-thunderhead-drsprovider deployment in the [***CLOUDERA INSTALLATION NAMESPACE***]-drs namespace, a timeout error appears. By default, the property is set to 900 seconds. In this scenario, you must manually verify whether the pods are up or not.

Timeout error during backup of OCP clusters

Problem

“The execution of the sync command has timed out" error appears during a backup event for OCP clusters.

Cause

This scenario is observed when the cluster is heavily used and the backup event is initiated during peak hours.

Solution

You can restart the nodes, this causes the disk to unmount and forces the operating system to write any data in its cache to the disk. After the restart is complete, initiate another backup. If any warnings appear, scrutinize to verify whether there are any dire warnings, otherwise the generated backup is safe to use. The only drawback in this scenario is the downtime impact, that is the time taken to back up the OCP clusters is longer than usual. Therefore, it is recommended that you back up the clusters during non-peak hours.

If the sync errors continue to appear, contact your IT department to check whether there is an issue with the storage infrastructure which might be preventing the sync command from completing on time.

Stale configurations in Cloudera Manager after a restore event

Cause

This scenario appears when you take a backup of the Cloudera Data Services on premises Cloudera Control Plane, upgrade Data Services, and then perform a restore. During the upgrade process, new parcels are activated and configurations in Cloudera Manager might have changed.

Solution

It is recommended that you restart Cloudera Manager after the upgrade process is complete and then initiate the restore event.

Existing namespaces are not deleted automatically after the restore event

Problem

The existing Liftie and monitoring namespaces in the environment (env2) are not deleted automatically when you perform the following steps:

  1. You take a backup of an environment ( env1).
  2. After the backup event is complete, you create another environment (env2).
  3. You restore the previously taken backup after which the Control Plane has only env1.

    The existing Liftie and monitoring namespaces in the env2 environment are not deleted after the Control Plane backup is restored.

Solution

Ensure that you manually delete the Liftie and monitoring namespaces of the env2 environment after the env1 backup is restored.

Backup event fails during volume snapshot creation process

Problem

The backup event fails during the volume snapshot creation process due to an error similar to Failed to check and update snapshot content: failed to take snapshot of the volume pvc-9d66b458-e10d-4d9c-a.

Cause

This issue might appear when multiple parallel jobs are running to take the volume snapshots of the same volume or might be because of latency issues.

Solution

Retry the backup event after a few minutes.

Restore event for an environment backup fails with an exception

Problem

When you delete an environment after the backup event, the restore operation for the environment fails and the Not able to fetch details from Cluster:... exception appears.

Cause

During the environment creation process, the environment service creates an internal Cloudera Manager user with Full Administrator role. The username is stored in the Cloudera Control Plane database, and the password is stored in the vault. When you delete an environment, the internal Cloudera Manager user gets deleted. The exception appears only if the password is no longer valid or might be missing. One of the reasons why the password might go missing is that while fixing a vault corruption, the vault might have been rebuilt without fixing the Cloudera Manager credentials.

Solution

  1. Get the internal Cloudera Manager username using the following commands to determine whether the credential is valid.
    1. Login into the environment using the kubectl exec -it cdp-embedded-db-0 -n [***CLOUDERA CONTROL PLANE NAMESPACE***] psql command.
    2. Connect to the environment database using the \c db-env; command.
    3. Run the following SQL query in the cdp-embedded-db-0 pod:

      SELECT e.environment_crn, c.value FROM environments e JOIN configs c ON e.environment_crn = c.environment_crn WHERE e.environment_name = '[***YOUR ENV NAME***]' AND c.attr = 'cmUser';

      Sample output:
        environment_crn                 | value
      ------------------------------------------------------
      crn:altus:environments:us-west-1:60eed1-46de-992-90b5-0ff943dae1c8:environment:test-saml2-env-1/48e9fcf-9620-4c8f-bc7d-caa76b1834f5 | __cloudera_internal_user__
      test-saml2-env-1-798414fe-faa6-43e1-ac9c-75c4d33ec294
      
      The __cloudera_internal_user__ test-saml2-env-1-798414fe-faa6-43e1-ac9c-75c4d33ec294 is the internal Cloudera Manager username in the sample output.
  2. Get the internal Cloudera Manager password using the following commands:
    1. Run the following commands to get the root token for the embedded vault:
      1. If you are using OCP:

        $ kubectl get secret vault-unseal-key -n [***VAULT-NAMESPACE***] -o jsonpath="{.data.init\.json}" | base64 -d {"keys":["[***VALUE***]"],"keys_base64":["[***value***]="],"recovery_keys":null,"recovery_keys_base64":null,"root_token":"[***VALUE***]"} command returns the vault root token.

      2. If you are using ECS:
          • [root@cm_server_db_host ~]# psql -U cm cm
          • select * from CONFIGS where attr like '%vault_root%';
          Sample output:
          config_id | role_id | attr | value | service_id | host_id | config_container_id | optimistic_lock_version | role_config_group_id | context | external_account_id | key_id
          ------------+---------+------------+------------------------------+---
          1546337327 | | vault_root | hvs.SvIrIhhffYEmVPEWN3TSEzks | 1546337154 |   | | 0 | | NONE ||
          

          The hvs.SvIrIhhffYEmVPEWN3TSEzks value in the above sample output is the vault token.

    2. kubectl exec -it vault-0 -n [***VAULT_NAMESPACE***] /bin/sh
    3. export VAULT_TOKEN=[***VAULT ROOT TOKEN***]
    4. ~ $ vault secrets list -detailed -tls-skip-verify
      Sample output:
      Path          Plugin       Accessor              Default TTL    Max TTL    Force No Cache    Replication    Seal Wrap    External Entropy Access    Options           Description                                                UUID                                    Version    Running Version          Running SHA256    Deprecation Status
      ----          ------       --------              -----------    -------    --------------    -----------    ---------    -----------------------    -------           -----------                                                ----                                    -------    ---------------          --------------    ------------------
      cubbyhole/    cubbyhole    cubbyhole_35ff7854    n/a            n/a        false             local          false        false                      map[]             per-token private secret storage                           f2fa15ec-49-cea2-88f6-e6807c30fba3    n/a        v1.13.1+builtin.vault    n/a               n/a
      identity/     identity     identity_b7aa2294     system         system     false             replicated     false        false                      map[]             identity store                                             17990faa-e0-727a-92a3-aaaa1ff43393    n/a        v1.13.1+builtin.vault    n/a               n/a
      kv/           kv           kv_2ba3b77c           system         system     false             replicated     false        false                      map[version:2]    key/value secret storage                                   98b14495-b6-6958-04bc-1ca7c55d4590    n/a        v0.14.2+builtin          n/a               supported
      secret/       kv           kv_218f4379           system         system     false             replicated     false        false                      map[version:2]    key/value secret storage                                   06371963-e6-56c1-7ab3-d6c438720dbf    n/a        v0.14.2+builtin          n/a               supported
      sys/          system       system_46e657a4       n/a            n/a        false             replicated     true         false                      map[]             system endpoints used for control, policy and debugging    8ca5d96f-a45e-155a-cfc1-25a56b6a0de5    n/a        v1.13.1+builtin.vault    n/a               n/a
      

      In this command output, kv/ is the secret path.

    5. ~ $ vault kv list -tls-skip-verify kv
      Sample output:
      Keys
      ----
      [***CLOUDERA CONTROL PLANE NAMESPACE***]
      
    6. ~ $ vault kv list -tls-skip-verify kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]
      Sample output:
      Keys
      ----
      data/
      liftie/
      test
      
    7. ~ $ vault kv list -tls-skip-verify kv/<[***CLOUDERA CONTROL PLANE NAMESPACE***]/data
      Sample output:
      Keys
      ----
      [***ENV NAME1***] 
      [***ENV NAME2***]

      Identify the environment for which the exception appeared.

    8. ~ $ vault kv list -tls-skip-verify kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]/[***ENTER THE ENV NAME WITH THE EXCEPTION***]
      Sample output:
      Keys
      ----
      [***RANDOM UUID***]
      
    9. ~ $ vault kv list -tls-skip-verify kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]/[***ENTER THE ENV NAME WITH THE EXCEPTION***]/[***RANDOM UUID***]
      Sample output:
      Keys
      ----
      cmPassword
      dockerConfigJson
      kubeconfig
      
    10. ~ $ vault kv get -tls-skip-verify kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]/[***ENTER THE ENV NAME WITH THE EXCEPTION***]/[***RANDOM UUID***]/cmPassword
      Sample output:
      ================ Secret Path ======================
      kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]/[***ENV NAME***]/[***RANDOM UUID***]/cmPassword
      
      
      ======= Metadata =======
      Key                Value
      ---                -----
      created_time       2023-11-15T04:32:36.477837897Z
      custom_metadata    <nil>
      deletion_time      n/a
      destroyed          false
      version            1
      
      ==== Data ====
      Key      Value
      ---      -----
      value    ae4cff8a-fcee-48e9-b381-4a16e883694a88c8d2
      

      The value is the cmPassword (Cloudera Manager password).

  3. Log into Cloudera Manager using the username (cloudera_internal_user) and password (cmPassword) that you obtained in the previous steps.
  4. Run the following commands as shown to regenerate the internal Cloudera Manager credentials in bash:
    1. [root@user ~]# uuidgen command creates the first universally unique identifier (UUID) which you use in the Cloudera Manager username.
      Sample output:
      dc7c7dd7-5a58-497a-a1d1-46cd
    2. [root@user ~]# uuidgen command creates another universally unique identifier (UUID) which is the Cloudera Manager password.
      Sample output:
      9a863dc4-be61-430f-ac87-a4eba0
  5. Assemble the new Cloudera Manager username using the information from the previous commands in the "__cloudera_internal_user__" + [***ENTER THE ENV NAME WITH THE EXCEPTION***] + "-" + [***FIRST_UUID***] format.
    For example, __cloudera_internal_user__cldrienv1-dc7c7dd7-5a58-497a-a1d1-46cd. In this assembled Cloudera Manager username, the prefix __cloudera_internal_user__ is followed by a string that contains the name of the environment with the exception cldrienv1 and the generated UUID dc7c7dd7-5a58-497a-a1d1-46cd separated by "-".

    The new Cloudera Manager password is the second UUID. For example, 9a863dc4-be61-430f-ac87-a4eba0

  6. Go to the Cloudera Manager > Support > API Explorer > UsersResource > POST /users REST API, and perform the following steps:
    1. Click Try it out, and substitute the Cloudera Manager username and password in the following JSON string:
      {
        "items": [
          {
            "name": "[***NEW_CM_INTERNAL_USER***]",
            "password": "[***NEW_CM_INTERNAL_USER_PASSWORD***]",
            "authRoles": [
              {
                "displayName": "Full Administrator",
                "name": "ROLE_ADMIN"
              }
            ]
          }
        ]
      }
      
    2. Copy the JSON string into the REQUEST BODY, and click Execute.
      You get 200 response code.
  7. Verify whether you can use the username and password to log into Cloudera Manager.
  8. Replace the stale Cloudera Manager user with the new username with the following commands:
    1. kubectl exec -it cdp-embedded-db-0 -n [***CLOUDERA CONTROL PLANE NAMESPACE***] psql
    2. \c db-env;
    3. Run the following SQL queries in the cdp-embedded-db-0 pod:
      1. SELECT e.environment_crn, c.value FROM environments e JOIN configs c ON e.environment_crn = c.environment_crn WHERE e.environment_name = '[***YOUR ENV NAME***]' AND c.attr = 'cmUser';
      2. UPDATE configs SET value=’[***NEW CLOUDERA MANAGER INTERNAL USER***]’ WHERE environment_crn=’[***ENVIRONMENT CRN OF ENV WITH THE EXCEPTION***]’ AND attr=’cmUser’; command replaces the old Cloudera Manager username.
  9. Replace the stale Cloudera Manager password with the new password:
    1. Run the steps in Step 2 to find the Cloudera Manager user password credential path in the vault which should be in kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]/[***ENV-NAME***]/[***RANDOM UUID***]/cmPassword format.
    2. Run $ vault kv patch -tls-skip-verify kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]/[***ENV NAME WITH THE EXCEPTION***]/[***RANDOM UUID***]/cmPassword value=[***NEW_CM_INTERNAL_USER_PASSWORD***]
    3. Verify whether the cmPassword is changed using the $ vault kv get -tls-skip-verify kv/[***CLOUDERA CONTROL PLANE NAMESPACE***]/[***ENV NAME WITH THE EXCEPTION***]/[***RANDOM UUID***]/cmPassword command.