Configure backups for a Data Lake

The Data Lake provides a command line interface for managing Data Lake backup and restore operations. The system checks to make sure there isn't another backup or restore in progress.

  • Estimate the interval required for the backup.
  • Create the S3 or ABFS backup location before performing the backup.
  • Update the aws-cdp-bucket-access-policy to include the backup bucket as a resource. See “Minimal setup for cloud storage”.
  • Verify that the following roles have full read and write access to the backup location:
    • LOG_ROLE
    • RANGER_AUDIT_ROLE
    • DATALAKE_ADMIN_ROLE
  • Shut down principal services (see Principal services). This will help avoid mismatches between Data Lake metadata and data used by workloads and mismatches among the metadata stored in the Data Lake.
  • Stop all Data Hubs attached to the Data Lake before you perform any backup or restore operations.
  1. Log into a computer that has connectivity to the Data Lake host.
  2. Install the CDP CLI Client.
  3. Switch to a user account that has the environment admin role.
  4. Run a backup.
    Use the following command to run the Data Lake backup: $ cdp datalake backup-datalake --datalake-name <name> --backup-location <cloud storage location> [--backup-name <label text>]
    Where the options are the following:
    Option Example Description
    --datalake-name finance-dl This is the name of the Data Lake as configured in the CDP environment.
    --backup-location s3a://acme-finance-admin-bucket/backup-archiveor abfs://<container-name>@mydatalakesan.dfs.core.windows.net/backup_01/ The fully qualified name of the S3 bucket and object or ABFS location where the backup operation writes files. For S3 use the "S3a" file system protocol.
    [--backup-name] pre-upgrade0420 An optional label that helps humans identify one backup from another. The backup name can be used to identify a backup for restoring.

    On AWS:

    $ cdp datalake backup-datalake --datalake-name finance-dl 
         --backup-location s3a://acme-finance-admin-bucket/backup-archive 
         --backup-name pre-upgrade0420

    On Azure:

    $ cdp datalake backup-datalake --datalake-name my-datalake 
          --backup-location abfs://<container-name>@mydatalakesan.dfs.core.windows.net/backup_01/

    The output of the command shows the current status of the operation. Note the internal state shows the status of each separate backup operation. If any of the individual backups fail, the overall status is failed and the backup cannot be restored. (Line breaks added for readability.)

    {
        "accountId": "9d74eee4-1cad-45d7-b654-7ccf9edbb73d",
        "backupId": "415927d9-9f7d-4d42-8000-71630e5938ca",
        "internalState": "{ATLAS_ENTITY_AUDIT_EVENTS_TABLE=IN_PROGRESS, EDGE_INDEX_COLLECTION=IN_PROGRESS, DATABASE=SUCCESSFUL, FULLTEXT_INDEX_COLLECTION=IN_PROGRESS, ATLAS_JANUS_TABLE=IN_PROGRESS, RANGER_AUDITS_COLLECTION=IN_PROGRESS, VERTEX_INDEX_COLLECTION=IN_PROGRESS}",
        "status": "IN_PROGRESS",
        "startTIme": "2021-04-20 20:10:16.567"
        "endTIme": "2021-04-20 20:10:45.341"
        "backupLocation":"s3a://acme-finance-admin-bucket/backup-archive
    /backup-archive",
        "backupName": "pre-upgrade0420"
        "failureReason": ""
    }
To see the status of the backup after the initial command, see “Checking the status of a Data Lake backup”