Recommended Hive configuration parameters for Hive ACID table replication policies

Before you create a Hive ACID table replication policy, you can configure the recommended parameters for optimum performance.

  1. Configure the following properties on the Cloudera Manager > Clusters > Hive-on-Tez Service > Configuration page, and then restart the Hive-on-Tez service:
    1. Configure maximum concurrent policies to run at a time.
      Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml property, enter hive.scheduled.queries.max.executors parameter, and the required value.

      For example, if you set the value to 30, Replication Manager runs a maximum of 30 replication policies at a time.

    2. Configure the connection pool size. Ensure that the value is equal to or higher than the number of configured maximum concurrent policies.
      Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml property, enter datanucleus.connectionPool.maxPoolSize parameter and the required value.
  2. Enable bootstrap load to run DistCp jobs in parallel from a single replication policy using the REPL LOAD command on the source cluster to set hive.exec.parallel to true, and then set the hive.exec.parallel.thread.number parameter equal to the number of cores at session level.
    For example, if the number of available cores in the source cluster is 128 and you want to run parallel replication policies, run the following commands:
    set hive.exec.parallel.thread.number=128
    REPL LOAD [***database name***] FROM [***directory name***] WITH ('hive.exec.parallel'='true'')
  3. Preserve owner or user permissions, group permissions, and HDFS ACLs in source and target clusters during replication.
    You can append the DistCp command line options in any combination (u for user, g for group, p for permission, and a for ACL) to the distcp.options command to preserve the permissions during Hive ACID table replication. The other DistCp command line options that you can use are r for replication number, b for block size, c for checksum-type, x for XAttr, and t for timestamp.

    You can use DistCp options only for the DistCp jobs that are initiated by Hive. To preserve the permissions and ACLs, set the DistCp command line options using the WITH clause in the REPL LOAD and REPL DUMP commands.

    For example, to preserve the owner or user permissions, group permissions, and ACLs, run the REPL LOAD [***database name***] FROM [***directory name***] WITH distcp.options.puga command.