Recommended Hive configuration parameters for Hive ACID table replication policies

Before you create a Hive ACID table replication policy, you can configure the recommended parameters for optimum performance.

  1. Configure the maximum concurrent policies to run at a time.
    1. Go to the Cloudera Manager > Clusters > Hive-on-Tez Service > Configuration tab.
    2. Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml property.
    3. Enter hive.scheduled.queries.max.executors, and the required number of concurrent policies.
      For example, if you set the hive.scheduled.queries.max.executors value to 30, Replication Manager runs a maximum of 30 replication policies at a time.
    4. Restart the Hive-on-Tez service.
  2. Configure the connection pool size. Ensure that the value is equal to or higher than the number of configured maximum concurrent policies.
    1. Go to the Cloudera Manager > Clusters > Hive service running the Hive Metastore instance > Configuration tab.
    2. Search for the hive_metastore_config_safety_valve property.
    3. Enter datanucleus.connectionPool.maxPoolSize, and the required number of connections. The default value is 10.
    4. Restart the Hive service.
  3. Enable bootstrap load to run DistCp jobs in parallel from a single replication policy using the REPL LOAD command on the source cluster to set hive.exec.parallel to true, and then set the hive.exec.parallel.thread.number parameter equal to the number of cores at session level.
    For example, if the number of available cores in the source cluster is 128 and you want to run parallel replication policies, run the following commands:
    set hive.exec.parallel.thread.number=128
    REPL LOAD [***DATABASE NAME***] FROM [***DIRECTORY NAME***] WITH ('hive.exec.parallel'='true'')
  4. Preserve owner or user permissions, group permissions, and HDFS ACLs in source and target clusters during replication.
    You can append the DistCp command line options in any combination (u for user, g for group, p for permission, and a for ACL) to the distcp.options command to preserve the permissions during Hive ACID table replication. The other DistCp command line options that you can use are r for replication number, b for block size, c for checksum-type, x for XAttr, and t for timestamp.

    You can use DistCp options only for the DistCp jobs that are initiated by Hive. To preserve the permissions and ACLs, set the DistCp command line options using the WITH clause in the REPL LOAD and REPL DUMP commands.

    For example, to preserve the owner or user permissions, group permissions, and ACLs, run the REPL LOAD [***DATABASE NAME***] FROM [***DIRECTORY NAME***] WITH distcp.options.puga command.