Parameters to optimize Hive ACID table replication performance
To optimize Hive ACID table replication performance, you can configure Hive configuration parameters.
hive.repl.retry.initial.delay
- Configure the first retry delay in seconds.
The default value is 60 seconds.
hive.repl.retry.backoff.coefficient
- Configure the exponential delay between retries.
(Previous Delay)
*(Backoff Coefficient)
determines the next retry interval.The default value is 1.2.
hive.repl.retry.jitter
- Configure the random jitter to avoid all retries happening at the same
time.
The default value is 30 seconds.
hive.repl.retry.max.delay.between.retries
- Configure the maximum allowed retry delay in minutes after including
exponential backoff.
The default value is 60 minutes.
hive.repl.retry.total.duration
- Configure the total allowed retry duration in hours which is inclusive of
all retries. Once this is exhausted, the policy instance is marked as failed
and needs manual intervention to restart.
The default value is 24 hrs.
hive.repl.approx.max.load.tasks
- Configure an approximate maximum number of tasks to run before the next set
of tasks is dynamically generated. This is an approximate value because Hive
stops at a slightly higher number as some events lead to a task increment
that might cross the specified limit.
The default value is 10000.
hive.repl.partitions.dump.parallelism
- Configure the number of threads to dump partition data information during
repl dump.
The default value is 100.
hive.repl.run.data.copy.tasks.on.target
- Configure the parameter to true so that replication runs the data copy tasks
during the repl load operation.
The default value is
true
. hive.repl.file.list.cache.size
- Configure the threshold for the maximum number of data copy locations to be
kept in memory. When the
hive.repl.run.data.copy.tasks.on.target
parameter is set totrue
, this parameter is not considered.The default value is 10000.
hive.repl.load.partitions.batch.size
- Configure the maximum number of partitions of a table to batch together
during a replication load. All the partitions in a batch makes a single
metastore call to update the metadata. The data for these partitions is
copied before the metadata batch is copied.
The default value is 10000.
hive.exec.copyfile.maxnumfiles
- Configure the maximum number of files that Hive uses to perform sequential
HDFS copies between directories. To increase the copy speed for a large
number of files, distributed copies (
distcp
) are used.The default value is 1L.
hive.exec.copyfile.maxsize
- Configure the maximum file size in bytes that Hive uses to perform single
HDFS copies between directories. To increase the copy speed for bigger
files, distributed copies (
distcp
) are used.The default value is 32L * 1024 * 1024.
- hive.exec.parallel.thread.number
- Maximum number of Hive ACID table replication policies that can run in parallel. The maximum number of parallel policies is equal to the number of available cores in the source cluster. Set this property at session level.