Migrating Hive Workloads to Cloudera Private CloudPDF version

Setting up the HDP cluster

You set up the HDP cluster after replicating one or more databases before you can verify replication. Set up requires stopping jobs.

Of course, before you can verify a replication, you must have completed one. The first cycle in replicating data is called the bootstrap, and the following cycles are called incremental replications. For each database, a completed replication consists of one bootstrap and at least one incremental replication cycle.
  • Run at least one incremental replication for a databases before attempting verification.
  • In the CDP cluster, find the dump directory path using the following query:
    select * from sys.replication_metrics 
    where policy_name=‘<policy name>’ 
    order by scheduled_execution_id desc limit 1; 
  • Find and copy the external table paths listed in the CDP dump directory path in _file_list_external file. You will use these paths to set up Ranger policies in Ambari.
  1. On the HDP source cluster, stop all ETL jobs.
  2. In Ambari > Ranger Admin > Service Manager > Hive policies, add a Deny policy (no writes) for all users including ‘hive’ on all databases: Database *, Table *, Hive column *
    You need only one policy to deny any writes to managed tables or any access to any external tables
  3. In Ambari > Ranger Admin > Service Manager > HDFS policies, add a Ranger Deny policy for all external table paths.
  4. In Resource Path, paste the external table paths you copied from in the CDP dump directory path in the _file_list_external file.
    You can add single or multiple policies for all the external table paths in all the databases.
    For example:
  5. Disable the StatsUpdaterThread background thread by configuring the hive.metastore.stats.auto.analyze property to none.
  6. Disable the PartitionManagementTask background thread by configuring the metastore.partition.management.database.pattern property to ^*.