Backup and Restore for the Data Lake
You can backup and restore the metadata maintained in the Data Lake services. Data Lake Backup and Restore is supported from Cloudera Runtime 7.2.1+.
- in preparation for a Data Lake upgrade.
- for archiving Data Lake content for later reference.
- to promote Data Lake content from test to production.
In these scenarios, you should run the Data Lake backup only after stopping the principal services (see "Principal services" below) so that metadata is not updated while the backup runs.
At this time, you can trigger a Data Lake backup through a CDP command-line operation
from any host with connectivity to the Data Lake. This functionality requires an entitlement on
your Cloudera account. The entitlement name is
What's backed up?
The backup operation saves a full snapshot of data from all SDX services:
- Audit events, saved in HBase tables
- Lineage data, saved as Janus graph data in HBase tables
- Edge, vertex, and full text indexes, saved in Solr collections
- Audit logs, saved as a Solr collection
- Permissions and tags, saved in RDBMS tables
HMS Metadata, saved in RDBMS tables
How do I keep the backup secure?
The backup files are stored on AWS S3 with encryption enabled.
To ensure that your backup content is secure, set the policies for accessing the bucket location such that the user running the backup can write to the location but not read or decrypt the individual files. The restore user need not have access to the backup location: the service account instance profile for the Data Lake will perform the actual read operation.
How often should backups run?
You can run backups as part of these events:
- Upgrades: a backup can be performed before performing an upgrade. This backup can be used to restore the existing environment or create a new environment in case the upgrade fails in a manner that requires you to re-build the original environment.
- Moving the Data Lake metadata (Atlas lineages, Ranger policies and audit information, and HMS metadata) from one environment to another.
When backups are taken, there is downtime for principal services (see "Principal services" below) due to the requirement to shut down HMS service on any attached Data Hubs.
Data Lake Restore vs Repair
Data Lake repair handles the replacement or repair of the compute resources for the Data Lake. The Data Lake restore replaces the Data Lake data content: the metadata managed by each of the Data Lake services.
- On the Data Lake:
- HMS Services on any attached Data Hub.
Limitations in the backup and restore system
- To avoid mismatches between Data Lake metadata and data used by workloads and mismatches among the metadata stored in the Data Lake, the Data Lake backup system should be run only when principal services (see "Principal services" above) are stopped.
- With CDP runtime 7.2.1, the initial backup may be written to an S3 bucket. All subsequent backups must be written to the same bucket as the initial backup.
- Backup-restore of the Data Lake to ADLS is not supported.
- Restoring the Data Lake to point in time in that past is not supported because the data and metadata may be out-of-sync. If the data and metadata are out-of-sync, the restore operation will leave the Data Lake in a non-functional state.
- Restoring a backup taken from a different version of CDP Runtime is not supported. For example, restoring a backup taken from CDP Runtime 7.2.1 to a Data Lake with CDP Runtime 7.2.2 is not supported.