Restoring to a RAZ Data Lake

You can restore a Data Lake backup from a non-RAZ Data Lake to a RAZ Data Lake, or from a RAZ Data Lake to a different RAZ Data Lake.

Best Practices

For best results, it’s recommended to use the same Storage Location Base, Logs Location Base, and Backup Location Base between the source Data Lake and the destination RAZ Data Lake. After the restore, the Ranger policies will be replaced with the policies from the original Data Lake. This means that if the Storage Location Base, Log Location Base, and/or Backup Location Base are different between the source Data Lake and the destination Data Lake, the restored Ranger policies will reference the locations from the source Data Lake.

  • If you intend to use the locations from the source Data Lake, make sure the roles associated with the destination RAZ Data Lake have sufficient permissions to access the original storage locations.

  • If you intend to use the destination RAZ Data Lake locations, the Ranger policies (cm_s3 for AWS, cm_adls for Azure) will need to be updated to reference the correct storage locations after the restore.

Preparing the Data Lakes

Add a Ranger policy to allow the backups to be read from the original backup location. For the restore operation, only write permissions are not required. This must be done on both the original source Data Lake before the backup is taken, and on the RAZ destination Data Lake before the restore is done.

AWS

On the source Data Lake:

  1. Open the Ranger UI.
  2. Go to the cm_s3 policy list.
  3. Add a new policy:
    1. Policy name: restore_to_raz
    2. S3 bucket: The bucket where the original backups were written
    3. Path: The path in the bucket where the original backups were written
  4. Add read permissions for the atlas, hbase, hdfs, and solr users under “Allow Conditions”.
  5. Repeat steps on the destination Data Lake.

Azure

Note: The source Data Lake and destination Data Lake must be configured to use the same Storage Account for the Log, Storage, and Backup Location Bases. They do not have to be configured to use the same Storage Container.

On the source Data Lake:

  1. Open the Ranger UI.
  2. Go to the cm_adls policy list.
  3. Add a new policy:
    1. Policy name: restore_to_raz
    2. Storage Account: The storage account where backups will be stored.
    3. Storage Account Container: The container where backups will be stored.
    4. Path: The path(s) in the bucket where backup will be written.

      Note: If more than one storage container will be used for backup, create a separate policy for each container.

  4. Add read and list permissions for the atlas, hbase, hdfs, and solr users under “Allow Conditions.”
  5. Repeat steps on the destination Data Lake.

Take the backup of the non-RAZ data lake

After you prepare the Data Lakes, see Configure backups for a Data Lake for instructions on running the backup.

Run the restore

After you take the Data Lake backup, see Restore Data Lake content for instructions on running the restore. Use the backup-id from the backup taken in the previous step.

After the restore

If the Storage Location Base, Logs Location Base, and Backup Locations Base are the same between the source Data Lake and the destination Data Lake, this section can be skipped.

After the restore, the Ranger policies will be replaced with the policies from the original Data Lake. This means that if the Storage Location Base, Log Location Base, and/or Backup Location Base are different between the original Data Lake and the RAZ Data Lake, the restored Ranger policies will reference the locations from the original Data Lake.

  • If you intend to use the locations from the original Data Lake, make sure the roles associated with the RAZ Data Lake have sufficient permissions to access the original storage locations.
  • If intending to use the RAZ Data Lake locations, the Ranger policies (cm_s3 for AWS, cm_adls for Azure) will need to be updated to reference the correct storage locations after the restore.