Troubleshooting Data Lake backup operations

Possible issues with Data Lake backups and suggested resolutions.

"failureReason": "[Gateway Timeout]"

This probably caused by a network or process timeout issue. If this doesn't resolve itself after a few minutes, check the messages at the environment and Data Lake levels to make sure there's not some larger issue happening.

"failureReason": "[HBase service HBase does not have a running Master.]"

This happens when the HBase service is not running. Check the HBase service page in Cloudera Manager to resolve any problems and restart the service.

"failureReason": "[Unable to get user data from UMS for CRN...

"failureReason": "[Unable to get user data from UMS for CRN crn:altus:iam:us-west-1:9d74eee4-1cad-45d7-b654-7ccf9
edbb73d:user:c44ac52c-625b-410c-a46c-8db204de4d92]"

This error appears when the user is not authorized to start a backup. The user needs to be an environment admin role. (Go to User Management in the Management Console)

"failureReason": "Failed to backup core=ranger_audits_[...]

(cdpclienv) [Wed Jun  3 16:32:27 CDT 2020 - biglauer@biglauer-7681-mbp15:/Users/biglauer/git]$cdp --endpoint-url https://cloudera.dps.mow-dev.cloudera.com datalake backup-datalake --datalake-name biglauer-dr-7  --backup-location hdfs://biglauer-dr-7-master0.biglauer.xcu2-8y8x.dev.cldr.work:8020/bigback --backup-name "biglauer-test-7-6"
{
    "accountId": "9d74eee4-1cad-45d7-b645-7ccf9edbb73d",
    "backupId": "51642dc2-1c95-4c44-a957-16d7fb615107",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECITON=IN_PROGRESS}",
    "status": "IN_PROGRESS",
    "startTime": "2020-06-04 14:35:38.195",
    "endTime": "Thu Jun 04 14:35:47 GMT 2020",
    "backupLocation": "hdfs://biglauer-dr-7-master0.biglauer.xcu2-8y8x.dev.cldr.work:8020/bigback/",
    "failureReason": ""
}
(cdpclienv) [Thu Jun  4 09:35:47 CDT 2020 - biglauer@biglauer-7681-mbp15:/Users/biglauer/git]$cdp --endpoint-url https://cloudera.dps.mow-dev.cloudera.com datalake backup-datalake-status --datalake-name biglauer-dr-7  --backup-name "biglauer-test-7-6"
{
    "accountId": "9d74eee4-1cad-45d7-b645-7ccf9edbb73d",
    "userCrn": "crn:altus:iam:us-west-1:9d74eee4-1cad-45d7-b645-7ccf9edbb73d:user:8d1e890c-8a2e-4cf9-96bd-d50478bc027e",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=SUCCESSFUL, EDGE_INDEX_COLLECTION=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=SUCCESSFUL, DATABASE=SUCCESSFUL, RANGER_AUDITS_COLLECTION=FAILED, ATLAS_JANUS_TABLE=SUCCESSFUL, VERTEX_INDEX_COLLECITON=SUCCESSFUL}",
    "status": "FAILED",
    "startTime": "2020-06-04 14:35:38.195",
    "endTime": "2020-06-04 14:35:58.833",
    "backupLocation": "hdfs://biglauer-dr-7-master0.biglauer.xcu2-8y8x.dev.cldr.work:8020/bigback/",
    "failureReason": "Failed to backup core=ranger_audits_shard1_replica_n21 because org.apache.solr.core.SolrCoreInitializationException: SolrCore 'ranger_audits_shard1_replica_n21' is not available due to init failure: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException): NameNode still not started\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:2239)\n\tat org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setSafeMode(NameNodeRpcServer.java:1225)\n\tat org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setSafeMode(ClientNamenodeProtocolServerSideTranslatorPB.java:853)\n\tat org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)\n\tat org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)\n\tat org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)\n\tat org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)\n\tat org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)\n\tat org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)\n"
}

When the Data Lake is shut down and then restarted, sometimes the Solr service starts incorrectly, causing backups and restore to fail.

To resolve, restart Solr service after the Data Lake is started.