Configure backups for a Data Lake

The Data Lake provides a command line interface for managing Data Lake backup and restore operations. The system checks to make sure there isn't another backup or restore in progress.

  • Create the S3 or ABFS backup location before performing the backup. For Azure, the container where the backup is stored should be in the same storage account as the Data Lake being backed up.
  • Shut down principal services (see Principal services). This will help avoid mismatches between Data Lake metadata and data used by workloads and mismatches among the metadata stored in the Data Lake.
  • Stop all Data Hubs attached to the Data Lake before you perform any backup or restore operations.
  • Stop any Virtual Warehouses that are running.

For AWS:

  • Apply the IAM policy for Data Lake backup to the following roles:

      For more informaiton on IAM roles, see Minimal setup for cloud storage.

      In the IAM policy for Data Lake backup, be sure to replace the <BACKUP_BUCKET> variable with the backup location used.

For Azure:

  • Verify that the following identities have the "Storage Blob Data Contributor" role on the container where the backup is stored:
    • Data Lake Admin identity
    • Ranger Audit identity
  1. Log into a computer that has connectivity to the Data Lake host.
  2. Install the CDP CLI Client.
  3. Switch to a user account that has the environment admin role.
  4. Run a backup.
    Use the following command to run the Data Lake backup: $ cdp datalake backup-datalake --datalake-name <name> --backup-location <cloud storage location> [--backup-name <label text>]
    Where the options are the following:
    Option Example Description
    --datalake-name finance-dl This is the name of the Data Lake as configured in the CDP environment.
    --backup-location s3a://acme-finance-admin-bucket/backup-archiveor abfs://<container-name> The fully qualified name of the S3 bucket and object or ABFS location where the backup operation writes files. For S3 use the "S3a" file system protocol.
    [--backup-name] pre-upgrade0420 An optional label that helps humans identify one backup from another. The backup name can be used to identify a backup for restoring.

    On AWS:

    $ cdp datalake backup-datalake --datalake-name finance-dl 
         --backup-location s3a://acme-finance-admin-bucket/backup-archive 
         --backup-name pre-upgrade0420

    On Azure:

    $ cdp datalake backup-datalake --datalake-name my-datalake 
          --backup-location abfs://<container-name>

    The output of the command shows the current status of the operation. Note the internal state shows the status of each separate backup operation. If any of the individual backups fail, the overall status is failed and the backup cannot be restored. (Line breaks added for readability.)

        "accountId": "9d74eee4-1cad-45d7-b654-7ccf9edbb73d",
        "backupId": "415927d9-9f7d-4d42-8000-71630e5938ca",
        "status": "IN_PROGRESS",
        "startTIme": "2021-04-20 20:10:16.567"
        "endTIme": "2021-04-20 20:10:45.341"
        "backupName": "pre-upgrade0420"
        "failureReason": ""
To see the status of the backup after the initial command, see Checking the status of a Data Lake backup