Recommended Hive configuration parameters for Hive ACID table replication policies
Before you create a Hive ACID table replication policy, you can configure the recommended parameters for optimum performance.
-
Configure the following properties on the Cloudera Manager > Clusters > Hive-on-Tez Service > Configuration page, and then restart the Hive-on-Tez service:
-
Configure maximum concurrent policies to run at a time.
Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml property, enter hive.scheduled.queries.max.executors parameter, and the required value.
For example, if you set the value to 30, Replication Manager runs a maximum of 30 replication policies at a time.
-
Configure the connection pool size. Ensure that the value is equal to
or higher than the number of configured maximum concurrent
policies.
Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml property, enter datanucleus.connectionPool.maxPoolSize parameter and the required value.
-
Configure maximum concurrent policies to run at a time.
-
Enable bootstrap load to run DistCp jobs in parallel from a single replication
policy using the REPL LOAD command on the source cluster to set
hive.exec.parallel to true, and then set the
hive.exec.parallel.thread.number parameter equal to
the number of cores at session level.
For example, if the number of available cores in the source cluster is 128 and you want to run parallel replication policies, run the following commands:
set hive.exec.parallel.thread.number=128 REPL LOAD [***database name***] FROM [***directory name***] WITH ('hive.exec.parallel'='true'')
-
Preserve owner or user permissions, group permissions, and HDFS ACLs in source
and target clusters during replication.
You can append the DistCp command line options in any combination (u for user, g for group, p for permission, and a for ACL) to the distcp.options command to preserve the permissions during Hive ACID table replication. The other DistCp command line options that you can use are r for replication number, b for block size, c for checksum-type, x for XAttr, and t for timestamp.
You can use DistCp options only for the DistCp jobs that are initiated by Hive. To preserve the permissions and ACLs, set the DistCp command line options using the WITH clause in the REPL LOAD and REPL DUMP commands.
For example, to preserve the owner or user permissions, group permissions, and ACLs, run the REPL LOAD [***database name***] FROM [***directory name***] WITH distcp.options.puga command.