Parameters to optimize Hive ACID table replication performance

To optimize Hive ACID table replication performance, you can configure Hive configuration parameters.

hive.repl.retry.initial.delay
Configure the first retry delay in seconds.

The default value is 60 seconds.

hive.repl.retry.backoff.coefficient
Configure the exponential delay between retries. (Previous Delay) * (Backoff Coefficient) determines the next retry interval.

The default value is 1.2.

hive.repl.retry.jitter
Configure the random jitter to avoid all retries happening at the same time.

The default value is 30 seconds.

hive.repl.retry.max.delay.between.retries
Configure the maximum allowed retry delay in minutes after including exponential backoff.

The default value is 60 minutes.

hive.repl.retry.total.duration
Configure the total allowed retry duration in hours which is inclusive of all retries. Once this is exhausted, the policy instance is marked as failed and needs manual intervention to restart.

The default value is 24 hrs.

hive.repl.approx.max.load.tasks
Configure an approximate maximum number of tasks to run before the next set of tasks is dynamically generated. This is an approximate value because Hive stops at a slightly higher number as some events lead to a task increment that might cross the specified limit.

The default value is 10000.

hive.repl.partitions.dump.parallelism
Configure the number of threads to dump partition data information during repl dump.

The default value is 100.

hive.repl.run.data.copy.tasks.on.target
Configure the parameter to true so that replication runs the data copy tasks during the repl load operation.

The default value is true.

hive.repl.file.list.cache.size
Configure the threshold for the maximum number of data copy locations to be kept in memory. When the hive.repl.run.data.copy.tasks.on.target parameter is set to true, this parameter is not considered.

The default value is 10000.

hive.repl.load.partitions.batch.size
Configure the maximum number of partitions of a table to batch together during a replication load. All the partitions in a batch makes a single metastore call to update the metadata. The data for these partitions is copied before the metadata batch is copied.

The default value is 10000.

hive.exec.copyfile.maxnumfiles
Configure the maximum number of files that Hive uses to perform sequential HDFS copies between directories. To increase the copy speed for a large number of files, distributed copies (distcp) are used.

The default value is 1L.

hive.exec.copyfile.maxsize
Configure the maximum file size in bytes that Hive uses to perform single HDFS copies between directories. To increase the copy speed for bigger files, distributed copies (distcp) are used.

The default value is 32L * 1024 * 1024.

hive.exec.parallel.thread.number
Maximum number of Hive ACID table replication policies that can run in parallel. The maximum number of parallel policies is equal to the number of available cores in the source cluster. Set this property at session level.
Before you set this value, configure the hive.exec.parallel parameter to true by running the REPL LOAD command using the WITH clause.