Creating a Hive replication policy

To replicate Hive metadata from on-premises to cloud, you must set the Ranger policy in Ranger and then create the Hive replication policy in Replication Manager.

To provide access, perform the following steps:
  1. Log in to Ranger Admin UI.
  2. In the Hadoop_SQL section, provide hdfs user permission to "all-database, table, column" in hdfs.
  1. On the Management Console > Replication Manager > Replication Policies, click Add Policy.
  2. In the Create Replication Policy wizard, select Hive.
  3. Enter the Hive replication Policy Name and Description. Click Next.
  4. Select Source Cluster from the drop-down.
  5. Enter the value for Source Databases and Tables.

    You can click icon to include additional databases and tables.

  6. Enter the value for Source User. Ensure that the user has the necessary permissions to replicate data.
  7. Click Next.
  8. Select the Destination Data Lake cluster from the drop-down.

    The Warehouse Path and The Hive External Table Base Directory path for the Data Lake appears. For example: S3://bucket_name/path

    For ABFS: abfs://qedevnat-filesystem@dmxabfsaccount.dfs.core.windows.net/cc-dmx-7y4aqf/warehouse/tablespace/external/hive

  9. Select Cloud Credential from the drop-down.
  10. Enter the Username.
  11. Click Validate Policy.
    The Replication Manager verifies the data with a status Validate Policy Source and Destination information.
  12. Click Next to schedule the replication policy.
  13. On the Schedulepage, choose one of the following options:
    • Run Now (Default) - The replication policy is immediately submitted and processed.
    • Schedule Run - The replication policy can be scheduled to run at specified time interval.
  14. In the Repeat field, you can choose one of the following options:
    • Does Not Repeat
    • Custom - In the Custom Recurrence dialog box, choose the time, date, and the frequency to run the policy.
  15. Click Next.
  16. On the Additional Settings page, enter the values as necessary:
    • YARN Queue Name - If you are using Capacity Scheduler queues to limit resource consumption, enter the name of the YARN queue for the cluster to which the replication job is submitted. The default value for this field is default.
    • Maximum Maps Slots - Use this option to set the maximum number of map tasks (simultaneous copies) per replication job. The default value is 20.
    • Maximum Bandwidth - You can adjust this setting so that each map task is throttled to consume only the specified bandwidth so that the net bandwidth used tends towards the specified value. The default value for the bandwidth is 100MB per second for each mapper.
  17. Choose one of the following Sentry permissions:
    • Include Sentry Permissions with Metadata - Select this option to migrate Sentry permissions during the replication job.
    • Exclude Sentry Permissions from Metadata (Default) - Select this option if you do not want to migrate Sentry permissions during the replication job.
    • Skip URI Privileges - Select this option if you do not want to include URI privileges when you migrate Sentry permissions. During migration, the URI privileges are translated to point to an equivalent location in S3. If the resources have a different location in Amazon S3, do not migrate the URI privileges because the URI privileges might not be valid.
  18. Click Create.
Once the newly created replication policy is successful, view the newly created replication job status on the Replication Policies page. Verify that the job starts and runs as expected.