Troubleshooting Data Lake restore operations

Possible issues with Data Lake restore and suggested resolutions.

Principal services running during restore

The most likely errors in restoring data from backup is that a service is in a state that is incompatible with the restore. Principal services (see Principal services) must be stopped before running the restore. Dependent services (see Dependent services) must be running to allow the restore to recreate their data. The restore checks the status of the principal services; however, if one of the dependent services is stopped and cannot be accessed to perform the restore operation, the restore operation will fail.

"failureReason": "[Datalake database restore failed.]"

If the principal services are running on the datalake during a restore operation, restore will fail with the following error message:
{
    "accountId": "8y63idy3-2ygn-98h6-j630-7uie9renq93e",
    "restoreId": "7c5c92c7-e3d3-408c-b18f-03bcfe0c9369",
    "backupId": "003b9882-e2fa-4fcc-ae8f-528de176c668",
    "userCrn": "crn:altus:iam:us-west-1:8y63idy3-2ygn-98h6-j630-7uie9renq93e:user:c87db52v-639m-613g-j94w-8hy944hn4i64",
    "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=SUCCESSFUL, EDGE_INDEX_COLLECTION=SUCCESSFUL, DATABASE=FAILED, FULLTEXT_INDEX_COLLECTION=SUCCESSFUL, EDGE_INDEX_COLLECTION_DELETE=SUCCESSFUL, VERTEX_INDEX_COLLECITON_DELETE=SUCCESSFUL, RANGER_AUDITS_COLLECTION_DELETE=SUCCESSFUL, RANGER_AUDITS_COLLECTION=SUCCESSFUL, ATLAS_JANUS_TABLE=SUCCESSFUL, VERTEX_INDEX_COLLECITON=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION_DELETE=SUCCESSFUL}",
    "status": "FAILED",
    "startTime": "2020-08-28 18:27:54.11",
    "endTime": "2020-08-28 18:29:55.507",
    "backupLocation": "s3a://eng-sdx-daily-datalake/smith-br-1/backup_01/",
    "failureReason": "[Datalake database restore failed.]"
}

To correct this scenario, stop the principal services and re-run the restore-datalake operation.

Failed restore renders Data Lake inoperable

If the restore operation fails, the Data Lake will be rendered inoperable. A restore-datalake operation must be re-run and complete successfully for the Data Lake to re-gain functionality