Configuring and running Data Lake restore
Data Lake restore replaces the Data Lake data content: the metadata managed by each of the Data Lake services.
- You are migrating Data Lake content to a new environment.
- A Data Lake repair fails.
- You need to delete and re-create a Data Lake.
- An upgrade fails or needs to be rolled back.
The Data Lake restore removes existing data stores and recreates them from the specified backup. This includes dropping database tables, dropping HBase tables, and deleting Solr collections.
The system checks to make sure there isn't another backup or restore in progress.
There is downtime when a Data Lake restore is performed, as some Data Lake services will be stopped. Additionally, access to the HMS/Ranger databases will be blocked for the duration of the restore. Do not attempt to run workloads when you are running Data Lake restore.
For AWS:
- Apply the IAM policy for Data Lake restore to the following roles:
- DATALAKE_ADMIN_ROLE
- RANGER_AUDIT_ROLE
- LOG_ROLE
For more informaiton on IAM roles, see Minimal setup for cloud storage.
In the IAM policy for Data Lake restore, be sure to replace the <your-backup-bucket> variable with the backup location used.
For Azure:
- Verify that the following identities have the "Storage Blob Data Contributor"
role on the container where the backup is stored:
- Data Lake Admin identity
- Ranger Audit Logger identity
For GCP:
Verify that the Logger Service account has the following required permissions:
- storage.buckets.get
- storage.objects.create
- storage.objects.get
- storage.objects.list
Verify that the Ranger Audit Service account has the following required permissions:
- resourcemanager.projects.get
- resourcemanager.projects.list
- storage.buckets.get
- storage.objects.get
- storage.objects.getIamPolicy
Note that the Ranger Audit service account permissions listed above should be granted to a custom role, not the default Storage Object Admin role.
For Cloudera Data Warehouse:
If you are a CDW user restoring a Data Lake to a new environment, perform the following steps from the CDW UI before running a Data Lake restore:
- Record details of your Database Catalogs and Virtual Warehouses, such as:
- Number of Virtual Warehouses associated with each Database Catalog and their names
- Configurations associated with each Virtual Warehouse and Database Catalog, especially the configurations which were customized
These configurations are not backed up and restored during the Data Lake backup and restore process. When you activate CDW again in the new CDP environment, you must re-apply the CDW configurations.
- Delete existing Virtual Warehouses and user-created Database Catalogs associated with your CDW environment. The meta-data for the Virtual Warehouses associated with the default Database Catalog are preserved. The data for your tables would be preserved, as long as the cloud storage locations like S3 buckets are intact.
- Deactivate the CDW environment, which deletes the default Database Catalog.
After you perform the Data Lake restore, you can activate the CDW environment from the CDW UI and re-create any Virtual Warehouses.
Important considerations regarding CDW metadata and data after Data Lake restore:
For Virtual warehouses associated with Default Database Catalog:
-
Metadata like databases, tables and views will be restored.
-
Data associated with existing tables would be visible as long as the cloud storage locations, such as objects in S3 buckets, were not deleted/modified after steps 2 and 3 above.
-
Query historys and saved queries in Hue or DAS would not be visible.
-
Any customizations to Virtual Warehouse or Database Catalog configurations are not retained after restore.
-
The Hive/Impala Runtime version in the Virtual Warehouse and Database Catalog would be the latest inline with the corresponding CDW version.
For Virtual warehouses associated with a non-default Database Catalog:
-
No metadata or data would be restored/visible.
As part of the restore operation, the RDS database will be restored unless it is specifically excluded. Note that the passwords for internal Ranger users (admin, keyadmin, etc.) are stored in RDS. When the RDS database is restored, it will replace the passwords for all internal Ranger users with the password for that user that was saved when the backup was originally done. This does not impact users that log into Ranger via SSO.
The user accounts impacted include, but are not limited to:
- admin user
- keyadmin user
- tagsync user
- usersync user
After a restore, a user with Ranger admin access can log into the Ranger UI to update the passwords of these users if desired.