Troubleshooting Data Lake backup operations

Possible issues with Data Lake backups and suggested resolutions.

"failureReason": "[Gateway Timeout]"

This probably caused by a network or process timeout issue. If this doesn't resolve itself after a few minutes, check the messages at the environment and Data Lake levels to make sure there's not some larger issue happening.

"failureReason": "[HBase service HBase does not have a running Master.]"

This happens when the HBase service is not running. Check the HBase service page in Cloudera Manager to resolve any problems and restart the service.

"failureReason": "[Unable to get user data from UMS for CRN...

"failureReason": "[Unable to get user data from UMS for CRN crn:altus:iam:us-west-1:7d24tin4-1ced-47d2-v375-8ccf3ndjj71d:user:7e4e753v-8n6t-4bj6-49op-g60894bc063y]"

This error appears when the user is not authorized to start a backup. The user needs to be an environment admin role. (Go to User Management in the Management Console)

"failureReason": "Failed to backup core=ranger_audits_[...]

(cdpclienv) [Wed Jun  3 16:32:27 CDT 2020 - smith@smith-7681-mbp15:/Users/smith/git]$cdp datalake backup-datalake --datalake-name smith-dr-7  --backup-location hdfs://smith-dr-7-master0.smith.xcu2-8y8x.dev.cldr.work:8020/smithback --backup-name "smith-test-7-6"
{
    "accountId": "7d24tin4-1ced-47d2-v375-8ccf3ndjj71d",
    "backupId": "32732sa2-1c95-4e33-a957-16d7fb645807",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
    "status": "IN_PROGRESS",
    "startTime": "2020-06-04 14:35:38.195",
    "endTime": "Thu Jun 04 14:35:47 GMT 2020",
    "backupLocation": "hdfs://smith-dr-7-master0.smith.xcu2-8y8x.dev.cldr.work:8020/smithback/",
    "failureReason": ""
}
(cdpclienv) [Thu Jun  4 09:35:47 CDT 2020 - smith@smith-7681-mbp15:/Users/smith/git]$cdp datalake backup-datalake-status --datalake-name smith-dr-7  --backup-name "smith-test-7-6"
{
    "accountId": "7d24tin4-1ced-47d2-v375-8ccf3ndjj71d",
    "userCrn": "crn:altus:iam:us-west-1:7d24tin4-1ced-47d2-v375-8ccf3ndjj71d:user:7e4e753v-8n6t-4bj6-49op-g60894bc063y",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=SUCCESSFUL, EDGE_INDEX_COLLECTION=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=SUCCESSFUL, DATABASE=SUCCESSFUL, RANGER_AUDITS_COLLECTION=FAILED, ATLAS_JANUS_TABLE=SUCCESSFUL, VERTEX_INDEX_COLLECITON=SUCCESSFUL}",
    "status": "FAILED",
    "startTime": "2020-06-04 14:35:38.195",
    "endTime": "2020-06-04 14:35:58.833",
    "backupLocation": "hdfs://smith-dr-7-master0.smith.xcu2-8y8x.dev.cldr.work:8020/smithback/",
    "failureReason": "Failed to backup core=ranger_audits_shard1_replica_n21 because org.apache.solr.core.SolrCoreInitializationException: SolrCore 'ranger_audits_shard1_replica_n21' is not available due to init failure: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException): NameNode still not started\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:2239)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setSafeMode(NameNodeRpcServer.java:1225)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setSafeMode(ClientNamenodeProtocolServerSideTranslatorPB.java:853)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)\n\tat org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)\n\tat org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)\n"
}

When the Data Lake is shut down and then restarted, sometimes the Solr service starts incorrectly, causing backups and restore to fail.

To resolve, restart Solr service after the Data Lake is started.

{{Failed to check the existance of s3a://eng-sdx-datalake/smith-perf-1/backup_01/. Is it valid?}} or Could not verify the existence of s3a://eng-sdx-datalake/smith-perf-1/backup_01/ -- Is it accessible?

These error messages can be caused by either a permissions issue (the Data Lake/environment does not have access to the bucket), or you have designated a bucket that does not exist.

"Solr failure: Could not find a backup repository with name ..."

If you receive this error and you are attempting a backup on Runtime version 7.2.0 or earlier, then backup and restore operations are not supported on your current version.