Statistics replication
Basic statistics such as the number of rows of a table or partition and the column statistics such as histograms (min, max, count) of a particular interesting column are important in many ways.
One of the key use cases of statistics is query optimization.
Hive supports Cost Based Optimizer (CBO) which primarily depends on column
statistics to optimize the query execution plan. For example, if the min and max values
of an integer type column “c” in a partition “p” is min=10 and max=50,
then a query with predicate such as c < 10 or c > 50 or anything that
does not fall under the range, shall avoid scanning the partition “p”. For
transactional tables, column statistics are supported only if
hive.txn.stats.enabled=true
.
Currently, there are two modes to compute statistics:
- ANALYZE command:
analyze table t [partition p] compute statistics for [columns c,...];
- Hive automatically computes the statistics
when
hive.stats.autogather=true
for basic statistics andhive.stats.column.autogather=true
for column statistics.
As statistics are key for query optimization, when table/partition data is replicated to target cluster, it is important to replicate the statistics as well which would speed up the queries running on target clusters. One of the key requirement for statistics replication to work accurately is it’s consistency with current dataset in a table or partition. Hive replication can achieve it with point-in time consistent incremental replication model. If any database/table is bootstrapped, then corresponding basic and columns statistics (if present) would also be bootstrapped.
As Hive replication takes care of replicating statistics to target cluster, it
is recommended to disable
hive.stats.autogather
andhive.stats.column.autogather
at the target cluster. If it is enabled, it would cause additional computation cost in
the target cluster.
Important | |
---|---|
For existing policies, any statistics gathered after deploying DLM
1.5.0 will be replicated to the target cluster. Any statistics gathered before DLM 1.5.0
deployment will not be replicated to the target cluster. For any new policies created
using DLM 1.5.0 release onwards, statistics is replicated to the target cluster.
|