Backup and restore for the Data Lake

You can backup and restore the metadata maintained in the Data Lake services. The backup and restore operation creates a comprehensive backup that improves the likelihood of data in the backup to be synchronized across all the services running in the Data Lake.

Use the Data Lake backup and restore:
  • in preparation for a Data Lake upgrade.
  • for archiving Data Lake content for later reference.
  • to promote Data Lake content from test to production.

In these scenarios, you should run the Data Lake backup only after stopping the principal services (see "Principal services" below) so that metadata is not updated while the backup runs.

At this time, you can trigger a Data Lake backup through a CDP command-line operation from any host with connectivity to the Data Lake.

Note that the Data Lake content includes metadata about your cluster workloads and does not backup the data itself. If you have to perform a restore of a backup, there can be a mismatch between the restored metadata and the current state of the workload data because events may have occurred on the cluster since the time of the backup. To reduce this time processing gap, run backups frequently and make sure that the backups complete successfully.

What's backed up?

The backup operation saves a full snapshot of data from all SDX services:

  • Atlas:
    • Audit events, saved in HBase tables
    • Lineage data, saved as Janus graph data in HBase tables
    • Edge, vertex, and full text indexes, saved in Solr collections
  • Ranger:
    • Audit logs, saved as a Solr collection
    • Permissions and tags, saved in RDBMS tables
  • HMS Metadata, saved in RDBMS tables

How do I keep the backup secure?

The backup files are stored on AWS S3 or ADLS with encryption enabled.

To ensure that your backup content is secure, set the policies for accessing the bucket or container location such that the user running the backup can write to the location but not read or decrypt the individual files. The restore user need not have access to the backup location: the service account instance profile for the Data Lake will perform the actual read operation.

How often should backups run?

You can run backups as part of these events:

  • Upgrades: a backup can be performed before performing an upgrade. This backup can be used to restore the existing environment or create a new environment in case the upgrade fails in a manner that requires you to re-build the original environment.
  • Moving the Data Lake metadata (Atlas lineages, Ranger policies and audit information, and HMS metadata) from one environment to another.

When backups are taken, there is downtime for principal services (see "Principal services" below) due to the requirement to shut down HMS service on any attached Data Hubs.

Data Lake Restore vs Repair

Data Lake repair handles the replacement or repair of the compute resources for the Data Lake. The Data Lake restore replaces the Data Lake data content: the metadata managed by each of the Data Lake services.

Principal services

The following principal services affect backup and restore operations::
  • On the Data Lake:
    • Atlas
    • Ranger
    • HMS
  • HMS Services on any attached Data Hub.
They cannot run during a backup or restore operation.

Dependent services

Backup and restore operations are dependent on the following services:
  • HBase
  • Solr
  • ZooKeeper
  • Databases services (e.g. Postgres)
They must be running during a backup or restore operation.

Other considerations

Do not stop or restart Data Lake services in Cloudera Manager if you are planning on running backup or restore operations. If you stop or restart Data Lake services from Cloudera Manager, such as restarting all services, or restarting the HBase or Solr services, Data Lake backup and restore operations from the CLI will be allowed to go forward, but may fail.

You must stop all Data Hubs attached to a Data Lake before performing backup and restore operations. A running data hub will prevent restore for functioning, and may result in inconsistent data in a backup.

Limitations in the backup and restore system

  • With Cloudera Runtime 7.2.1, the initial backup on AWS may be written to an S3 bucket. All subsequent backups must be written to the same bucket as the initial backup. This limitation does not exist on either AWS or Azure for Runtime 7.2.2.
  • Restoring a backup of an AWS Data Lake to an Azure Data Lake, or an Azure Data Lake to an AWS Data Lake is not supported.
  • Restoring the Data Lake to a point in time in the past is not supported, because the data and metadata may be out-of-sync. If the data and metadata are out-of-sync, the restore operation will leave the Data Lake in a non-functional state.
  • Restoring a backup taken from a different version of CDP Runtime is not supported. For example, restoring a backup taken from Cloudera Runtime 7.2.1 to a Data Lake with Cloudera Runtime 7.2.2 is not supported.
  • If you perform a restore to a Data Lake with Runtime 7.2.6, the restore operation will fail.

    Workaround: Run the following script on the fatal/master node before you perform a restore operation on a Data Lake with Runtime 7.2.6:

    #!/usr/bin/env bash
    
    if [ $# -ne 1 ];
    then
       echo " Invalid inputs provided"
       echo " Script needs one input:"
       echo " 1. Backup Location(s3a://)"
       exit 1
    fi
    
    NAME=hdfs
    KEYTAB=$(find /run/cloudera-scm-agent/process -name ${NAME}.keytab | tail -n 1)
    PRINCIPAL=$(klist -kt "$KEYTAB" | tail -n 1 | awk '{print $4}')
    kinit -kt "$KEYTAB" "$PRINCIPAL"
    
    hadoop s3guard import $1