Replication tuning properties

You need to understand the Cloudera Manager properties for tuning Hive replication.

Table 1.
Property Description
Parameter Description
hive.repl.retry.initial.delay

First retry delay in seconds.

The default value is 60 seconds.

hive.repl.retry.backoff.coefficient

Exponential Delay between retries. (Previous Delay) * (Backoff Coefficient) will determine the next retry interval.

The default value is 1.2.

hive.repl.retry.jitter

A random jitter to be applied to avoid all retries happening at the same time.

The default value is 30 seconds.

hive.repl.retry.max.delay.between.retries

Maximum allowed retry delay in seconds after including exponential backoff. If this limit is reached, retry will continue with this retry duration.

The default value is 60 minutes.

hive.repl.retry.total.duration

Total allowed retry duration in seconds inclusive of all retries. Once this is exhausted, the policy instance will be marked as failed and will need manual intervention to restart.

The default value is 24 hrs.

hive.repl.approx.max.load.tasks

Provides an approximation of the maximum number of tasks that should be executed before dynamically generating the next set of tasks. The number is approximate as Hive will stop at a slightly higher number, the reason being some events might lead to a task increment that would cross the specified limit.

The default value is 10000.

hive.repl.partitions.dump.parallelism

Number of threads that will be used to dump partition data information during repl dump.

The default value is 100.

hive.repl.run.data.copy.tasks.on.target Indicates whether replication should run data copy tasks during the repl load operation. The default value is true.
hive.repl.file.list.cache.size

This parameter indicates threshold for the maximum number of data copy locations to be kept in memory. When the hive.repl.run.data.copy.tasks.on.target parameter is set to true, this parameter is not considered.

The default value is 10000.

hive.repl.load.partitions.batch.size

Provides the maximum number of partitions of a table that will be batched together during replication load. All the partitions in a batch will make a single metastore call to update the metadata. The data for these partitions will be copied before copying the metadata batch.

The default value is 10000.

hive.exec.copyfile.maxnumfiles

Maximum number of files Hive uses to do sequential HDFS copies between directories. Distributed copies (distcp) will be used instead for larger numbers of files so that copies can be done faster.

The default value is 1L.

hive.exec.copyfile.maxsize

Maximum file size (in bytes) that Hive uses to do single HDFS copies between directories. Distributed copies (distcp) will be used instead for bigger files so that copies can be done faster.

The default value is 32L * 1024 * 1024.

hive.exec.parallel.thread.number

Maximum number of Hive replication policies that can run parallelly. The maximum number of parallel policies is equal to the number of available cores in the source cluster. Set this property at session level.

Before you set this value, configure the hive.exec.parallel parameter to true by running the REPL LOAD command using the WITH clause.