Backup and Restore for the Data Lake

You can backup and restore the metadata maintained in the Data Lake services. Data Lake Backup and Restore is supported from Cloudera Runtime 7.2.1+.

This single operation creates a comprehensive backup that improves the likelihood of data in the backup to be synchronized across all the services running in the Data Lake. Use the Data Lake backup and restore:
  • in preparation for a Data Lake upgrade.
  • for archiving Data Lake content for later reference.
  • to promote Data Lake content from test to production.

In these scenarios, you should run the Data Lake backup only after stopping the principal services (see "Principal services" below) so that metadata is not updated while the backup runs.

At this time, you can trigger a Data Lake backup through a CDP command-line operation from any host with connectivity to the Data Lake. This functionality requires an entitlement on your Cloudera account. The entitlement name is DATA_LAKE_BACKUP_AND_RESTORE.

What's backed up?

The backup operation saves a full snapshot of data from all SDX services:

  • Atlas:
    • Audit events, saved in HBase tables
    • Lineage data, saved as Janus graph data in HBase tables
    • Edge, vertex, and full text indexes, saved in Solr collections
  • Ranger:
    • Audit logs, saved as a Solr collection
    • Permissions and tags, saved in RDBMS tables
  • HMS Metadata, saved in RDBMS tables

How do I keep the backup secure?

The backup files are stored on AWS S3 with encryption enabled.

To ensure that your backup content is secure, set the policies for accessing the bucket location such that the user running the backup can write to the location but not read or decrypt the individual files. The restore user need not have access to the backup location: the service account instance profile for the Data Lake will perform the actual read operation.

How often should backups run?

You can run backups as part of these events:

  • Upgrades: a backup can be performed before performing an upgrade. This backup can be used to restore the existing environment or create a new environment in case the upgrade fails in a manner that requires you to re-build the original environment.
  • Moving the Data Lake metadata (Atlas lineages, Ranger policies and audit information, and HMS metadata) from one environment to another.

When backups are taken, there is downtime for principal services (see "Principal services" below) due to the requirement to shut down HMS service on any attached Data Hubs.

Data Lake Restore vs Repair

Data Lake repair handles the replacement or repair of the compute resources for the Data Lake. The Data Lake restore replaces the Data Lake data content: the metadata managed by each of the Data Lake services.

Principal services

The following principal services are dependencies during backup and restore operations:
  • On the Data Lake:
    • Atlas
    • Ranger
    • HMS
  • HMS Services on any attached Data Hub.
They cannot run during a backup or restore operation.

Dependent services

The following are dependent services:
  • HBase
  • Solr
  • ZooKeeper
  • Databases
They cannot run during a backup or restore operation.

Limitations in the backup and restore system

  • To avoid mismatches between Data Lake metadata and data used by workloads and mismatches among the metadata stored in the Data Lake, the Data Lake backup system should be run only when principal services (see "Principal services" above) are stopped.
  • With CDP runtime 7.2.1, the initial backup may be written to an S3 bucket. All subsequent backups must be written to the same bucket as the initial backup.
  • Backup-restore of the Data Lake to ADLS is not supported.
  • Restoring the Data Lake to point in time in that past is not supported because the data and metadata may be out-of-sync. If the data and metadata are out-of-sync, the restore operation will leave the Data Lake in a non-functional state.
  • Restoring a backup taken from a different version of CDP Runtime is not supported. For example, restoring a backup taken from CDP Runtime 7.2.1 to a Data Lake with CDP Runtime 7.2.2 is not supported.