Backup and restore for the Data Lake

You can backup and restore the metadata maintained in the Data Lake services. The backup and restore operation creates a comprehensive backup that improves the likelihood of data in the backup to be synchronized across all the services running in the Data Lake.

Required role: EnvironmentAdmin

Use the Data Lake backup and restore:
  • in preparation for a Data Lake upgrade.
  • for archiving Data Lake content for later reference.
  • to promote Data Lake content from test to production.

There will be a downtime when a Data Lake backup is performed, as some Data Lake services will be stopped. Additionally, access to the HMS/Ranger databases will be blocked for the duration of the backup.

Note that a Data Lake Backup includes metadata about your cluster workloads and does not include the data itself.

Important: Data lake backup/restore operations should be performed in quiescent mode. Cloudera recommends that you do not allow any updates to the metadata in the time frame between performing a backup and using the backup to restore a Data Lake. Especially in the case of HMS, updates after the backup is performed will cause inconsistency between the HMS metadata and the data.

At this time, you can trigger a Data Lake backup through a CDP command-line operation from any host with connectivity to the Data Lake.

What's backed up?

The backup operation saves a full snapshot of data from all SDX services:

  • Atlas:
    • Audit events, saved in HBase tables
    • Lineage data, saved as Janus graph data in HBase tables
    • Edge, vertex, and full text indexes, saved in Solr collections
  • Ranger:
    • Audit logs, saved as a Solr collection
    • Permissions and tags, saved in RDBMS tables
  • HMS Metadata, saved in RDBMS tables

How do I keep the backup secure?

The backup files are stored on AWS S3 or ADLS with encryption enabled.

To ensure that your backup content is secure, set the policies for accessing the bucket or container location such that the user running the backup can write to the location but not read or decrypt the individual files. The restore user need not have access to the backup location: the service account instance profile for the Data Lake will perform the actual read operation.

How often should backups run?

You can run backups as part of these events:

  • Upgrades: a backup can be performed before performing an upgrade. This backup can be used to restore the existing environment or create a new environment in case the upgrade fails in a manner that requires you to re-build the original environment.
  • Moving the Data Lake metadata (Atlas lineages, Ranger policies and audit information, and HMS metadata) from one environment to another.

When backups are taken, there is downtime for principal services (see "Principal services" below) due to the requirement to shut down HMS service on any attached Data Hubs.

Data Lake Restore vs Repair

Data Lake repair replaces the compute resources and reconnects them to the persistent Data Lake storage. Data Lake restore replaces the existing Data Lake content with content from a Data Lake backup.

Principal services

The following principal services affect backup and restore operations::
  • On the Data Lake:
    • Atlas
    • Ranger
    • HMS
  • HMS Services on any attached Data Hub.
They cannot run during a backup or restore operation.

Dependent services

Backup and restore operations are dependent on the following services:
  • HBase
  • Solr
  • ZooKeeper
  • Databases services (e.g. Postgres)
They must be running during a backup or restore operation.

Other considerations

Do not stop or restart Data Lake services in Cloudera Manager if you are planning on running backup or restore operations. If you stop or restart Data Lake services from Cloudera Manager, such as restarting all services, or restarting the HBase or Solr services, Data Lake backup and restore operations from the CLI will be allowed to go forward, but may fail.

You must stop all Data Hubs attached to a Data Lake before performing backup and restore operations. A running data hub will prevent restore for functioning, and may result in inconsistent data in a backup.

What's supported?

Support for AWS Runtime 7.2.1 and above
Support for Azure Runtimes 7.2.2 and above

Support for Backup and restore across Data Lakes with different shapes

For example, a backup taken on a light duty Data Lake can be used to restore a medium duty Data Lake.

Runtimes 7.2.8 and above

Support for Runtimes 7.1.0 and 7.2.0

Data Lake backup only writes to HDFS on the Data Lake. The backup should be moved to and from cloud storage using the provided procedures. Contact Cloudera Support for more information.

What's not supported?

Backup and restore across different cloud providers is not supported

Example: Restoring a backup of an AWS Data Lake to an Azure Data Lake, or an Azure Data Lake to an AWS Data Lake is not supported.

Recovery for individual settings or pieces of metadata. Example: Recovering just the Hive SQL Ranger policies from a given backup, or recovering just the HMS metadata of a particular database.
Point in Time Recovery of data synced with metadata A restore will only recover HMS metadata and will apply that to the existing data used by the Data Hubs of the environment. This means that HMS metadata may not be in sync with the data.
Restoring a backup taken from a different version of CDP Runtime

Example: Restoring a backup taken from Cloudera Runtime 7.2.1 to a Data Lake with Cloudera Runtime 7.2.2 is not supported.

Backup and restore to GCP cloud storage

Limitations in the backup and restore system

  • With Cloudera Runtime 7.2.1, the initial backup on AWS may be written to an S3 bucket. All subsequent backups must be written to the same bucket as the initial backup. This limitation does not exist for later runtime versions.