StatsRulesProcFactory.GroupByStatsRule (Hive Query Language 0.13.0.2.1.2.0-402 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.hive.ql.optimizer.stats.annotation
Class StatsRulesProcFactory.GroupByStatsRule

java.lang.Object
  org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.DefaultStatsRule
      org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.GroupByStatsRule

All Implemented Interfaces:: NodeProcessor

Enclosing class:: StatsRulesProcFactory

public static class StatsRulesProcFactory.GroupByStatsRule
extends StatsRulesProcFactory.DefaultStatsRule
implements NodeProcessor
extends StatsRulesProcFactory.DefaultStatsRule
implements NodeProcessor

GROUPBY operator changes the number of rows. The number of rows emitted by GBY operator will be atleast 1 or utmost T(R) (number of rows in relation T) based on the aggregation. A better estimate can be found if we have column statistics on the columns that we are grouping on.

Suppose if we are grouping by attributes A,B,C and if statistics for columns A,B,C are available then a better estimate can be found by taking the smaller of product of V(R,[A,B,C]) (product of distinct cardinalities of A,B,C) and T(R)/2.

T(R) = min (T(R)/2 , V(R,[A,B,C]) ---> [1]

In the presence of grouping sets, map-side GBY will emit more rows depending on the size of grouping set (input rows * size of grouping set). These rows will get reduced because of map-side hash aggregation. Hash aggregation is an optimization in hive to reduce the number of rows shuffled between map and reduce stage. This optimization will be disabled if the memory used for hash aggregation exceeds 90% of max available memory for hash aggregation. The number of rows emitted from map-side will vary if hash aggregation is enabled throughout execution or disabled. In the presence of grouping sets, following rules will be applied

If hash-aggregation is enabled, for query SELECT * FROM table GROUP BY (A,B) WITH CUBE

T(R) = min(T(R)/2, T(R, GBY(A,B)) + T(R, GBY(A)) + T(R, GBY(B)) + 1))

where, GBY(A,B), GBY(B), GBY(B) are the GBY rules mentioned above [1]

If hash-aggregation is disabled, apply the GBY rule [1] and then multiply the result by number of elements in grouping set T(R) = T(R) * length_of_grouping_set. Since we do not know if hash-aggregation is enabled or disabled during compile time, we will assume worst-case i.e, hash-aggregation is disabled

NOTE: The number of rows from map-side GBY operator is dependent on map-side parallelism i.e, number of mappers. The map-side parallelism is expected from hive config "hive.stats.map.parallelism". If the config is not set then default parallelism of 1 will be assumed.

Worst case: If no column statistics are available, then T(R) = T(R)/2 will be used as heuristics.

For more information, refer 'Estimating The Cost Of Operations' chapter in "Database Systems: The Complete Book" by Garcia-Molina et. al.

Constructor Summary
`StatsRulesProcFactory.GroupByStatsRule()`

Method Summary
`Object`	`process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, Object... nodeOutputs)` Generic process for all ops that don't have specific implementations.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail