Configure Backups for a Data Lake

The Data Lake provides a command line interface for managing Data Lake backup and restore operations. The system checks to make sure there isn't another backup or restore in progress.

  • Estimate the interval required for the backup.
  • Shut down principal services (see "Backup and Restore for the Data Lake">"Principal services") .
  • The aws-cdp-bucket-access-policy needs to be updated to include the backup bucket as a resource. See “Minimal setup for cloud storage”.
  1. Log into a computer that has connectivity to the Data Lake host.
  2. Install the CDP CLI Client.
  3. Switch to a user account that has the environment admin role.
  4. Run a backup.
    Use the following command to run the Data Lake backup: $ cdp --endpoint-url <environment> datalake backup-datalake --datalake-name <name> --backup-location <HDFS path> [--backup-name <label text>]
    Where the options are the following:
    Option Example Description
    --datalake-name finance-dl This is the name of the Data Lake as configured in the CDP environment.
    --endpoint-url The URL associated with the environment that the Data Lake is part of.
    --backup-location s3a://acme-finance-admin-bucket/backup-archive The fully qualified name of the S3 bucket and object where the backup operation writes files. Use the "S3a" file system protocol.
    [--backup-name] pre-upgrade0420 An optional label that helps humans identify one backup from another. The backup name can be used to identify a backup for restoring.

    (Line breaks added for readability)

    $ cdp --endpoint-url 
         datalake backup-datalake --datalake-name finance-dl 
         --backup-location s3a://acme-finance-admin-bucket/backup-archive 
         --backup-name pre-upgrade0420

    The output of the command shows the current status of the operation. Note the internal state shows the status of each separate backup operation. If any of the individual backups fail, the overall status is failed and the backup cannot be restored. (Line breaks added for readability.)

        "accountId": "9d74eee4-1cad-45d7-b654-7ccf9edbb73d",
        "backupId": "415927d9-9f7d-4d42-8000-71630e5938ca",
        "status": "IN_PROGRESS",
        "startTIme": "2021-04-20 20:10:16.567"
        "endTIme": "2021-04-20 20:10:45.341"
        "backupName": "pre-upgrade0420"
        "failureReason": ""
To see the status of the backup after the initial command, see “Checking the status of a Data Lake backup”