To replicate Hive metadata from on-premises to cloud, you must set the Ranger policy in
Ranger and then create the Hive replication policy in Replication Manager.
To provide access, perform the following steps:
Log in to Ranger Admin UI.
In the Hadoop_SQL section, provide hdfs user permission
to "all-database, table, column" in hdfs.
On the Management Console > Replication Manager > Replication Policies, click Add Policy.
In the Create Replication Policy wizard, select
Hive.
Enter the Hive replication Policy Name and
Description. Click Next.
Select Source Cluster from the drop-down.
Enter the value for Source Databases and
Tables.
You can click icon to include additional databases and tables.
Enter the value for Source User. Ensure
that the user has the necessary permissions to replicate data.
Click Next.
Select the Destination Data Lake cluster from the
drop-down.
The Warehouse Path and The Hive External Table Base Directory path for
the Data Lake appears. For example: S3://bucket_name/path
For ABFS:
abfs://qedevnat-filesystem@dmxabfsaccount.dfs.core.windows.net/cc-dmx-7y4aqf/warehouse/tablespace/external/hive
Select Cloud Credential from the drop-down.
Enter the Username.
Click Validate Policy.
The Replication Manager verifies the data with a status Validate
Policy Source and Destination information.
Click Next to schedule the replication policy.
On the Schedulepage, choose one of the following options:
Run Now (Default) - The replication policy is immediately
submitted and processed.
Schedule Run - The replication policy can be scheduled to
run at specified time interval.
In the Repeat field, you can choose one of the following
options:
Does Not Repeat
Custom - In the Custom Recurrence
dialog box, choose the time, date, and the frequency to run the policy.
Click Next.
On the Additional Settings page, enter the values as
necessary:
YARN Queue Name - If you are using Capacity Scheduler
queues to limit resource consumption, enter the name of the YARN queue for the cluster
to which the replication job is submitted. The default value for this field is
default.
Maximum Maps Slots - Use this option to set the maximum
number of map tasks (simultaneous copies) per replication job. The default value is
20.
Maximum Bandwidth - You can adjust this setting so that
each map task is throttled to consume only the specified bandwidth so that the net
bandwidth used tends towards the specified value. The default value for the bandwidth is
100MB per second for each mapper.
Choose one of the following Sentry permissions:
Include Sentry Permissions with Metadata - Select this
option to migrate Sentry permissions during the replication job.
Exclude Sentry Permissions from Metadata (Default) - Select
this option if you do not want to migrate Sentry permissions during the replication job.
Skip URI Privileges - Select this option if you do not want
to include URI privileges when you migrate Sentry permissions. During migration, the URI
privileges are translated to point to an equivalent location in S3. If the resources
have a different location in Amazon S3, do not migrate the URI privileges because the
URI privileges might not be valid.
Click Create.
Once the newly created replication policy is successful, view
the newly created replication job status on the Replication Policies
page. Verify that the job starts and runs as expected.