Creating a Hive replication policy

You learn how to use the Hive scheduler to schedule replication policies by following a step-by-step procedure.

  • You set up the environment, mandatory configurations, and considered the Hive configuration properties for Hive replication.
  • You have Admin privileges.
  • You have Ranger permissions to create/alter the queries you schedule.
  • As a hive user, you have write access to replRoot1 directory shown in the example below.
In this task you run scheduled queries to dump and load data. You can configure the following options when you run the scheduled query:
  • Policy Name - As part of create query
  • Source - As part of the dump/load part of the create scheduled query
  • Destination - As part of the load part of the create scheduled query
  • Destination Staging Path - Specifying ‘hive.repl.rootdir’=’<PATH>’ as part of the with clause, The path can be either on source or target cluster.
  • External Table Base Directory. Using the config ‘hive.repl.replica.external.table.base.dir’=’<full path>
  1. On the source cluster, schedule a query to dump data for replication in regular intervals.
    Use the following syntax:
    create scheduled query repl_policyname every <frequency> as REPL DUMP <DB NAME> with (config_options);         
    For example,
    create scheduled query repl_pol1 every 10 minutes as repl dump sourc01 with('hive.repl.rootdir' = '/tmp/replRoot1');
  2. On the target cluster, schedule a query to load data at regular intervals.
    Use the following syntax:
    Create scheduled query repl_policyname every <frequency> as REPL LOAD <SOURCE DB NAME> into <TARGET DB NAME> with (config_options); 
    where config_options are key value pairs separated by a comma (,). For example: 'hive.repl.rootdir' = '/tmp/aa18', 'hive.repl.include.authorization.metadata' = 'true'
  3. Name the policy with a prefix repl_.
    For example,
    create scheduled query repl_dumppol1 every 1 minutes as repl load source01 into target01 with('hive.repl.rootdir'='/tmp/replRoot1');
    This naming convention is recommended because Scheduler is a generic scheduler in Hive, not just used for replication. The prefix triggers filtering of the replication-related schedules. Adding a repl prefix simplifies management.