|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.DefaultStatsRule org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.GroupByStatsRule
public static class StatsRulesProcFactory.GroupByStatsRule
GROUPBY operator changes the number of rows. The number of rows emitted by GBY operator will be atleast 1 or utmost T(R) (number of rows in relation T) based on the aggregation. A better estimate can be found if we have column statistics on the columns that we are grouping on.
Suppose if we are grouping by attributes A,B,C and if statistics for columns A,B,C are available then a better estimate can be found by taking the smaller of product of V(R,[A,B,C]) (product of distinct cardinalities of A,B,C) and T(R)/2.
T(R) = min (T(R)/2 , V(R,[A,B,C]) ---> [1]
In the presence of grouping sets, map-side GBY will emit more rows depending on the size of grouping set (input rows * size of grouping set). These rows will get reduced because of map-side hash aggregation. Hash aggregation is an optimization in hive to reduce the number of rows shuffled between map and reduce stage. This optimization will be disabled if the memory used for hash aggregation exceeds 90% of max available memory for hash aggregation. The number of rows emitted from map-side will vary if hash aggregation is enabled throughout execution or disabled. In the presence of grouping sets, following rules will be applied
If hash-aggregation is enabled, for query SELECT * FROM table GROUP BY (A,B) WITH CUBE
T(R) = min(T(R)/2, T(R, GBY(A,B)) + T(R, GBY(A)) + T(R, GBY(B)) + 1))
where, GBY(A,B), GBY(B), GBY(B) are the GBY rules mentioned above [1]
If hash-aggregation is disabled, apply the GBY rule [1] and then multiply the result by number of elements in grouping set T(R) = T(R) * length_of_grouping_set. Since we do not know if hash-aggregation is enabled or disabled during compile time, we will assume worst-case i.e, hash-aggregation is disabled
NOTE: The number of rows from map-side GBY operator is dependent on map-side parallelism i.e, number of mappers. The map-side parallelism is expected from hive config "hive.stats.map.parallelism". If the config is not set then default parallelism of 1 will be assumed.
Worst case: If no column statistics are available, then T(R) = T(R)/2 will be used as heuristics.
For more information, refer 'Estimating The Cost Of Operations' chapter in "Database Systems: The Complete Book" by Garcia-Molina et. al.
Constructor Summary | |
---|---|
StatsRulesProcFactory.GroupByStatsRule()
|
Method Summary | |
---|---|
Object |
process(Node nd,
Stack<Node> stack,
NodeProcessorCtx procCtx,
Object... nodeOutputs)
Generic process for all ops that don't have specific implementations. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public StatsRulesProcFactory.GroupByStatsRule()
Method Detail |
---|
public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, Object... nodeOutputs) throws SemanticException
NodeProcessor
process
in interface NodeProcessor
process
in class StatsRulesProcFactory.DefaultStatsRule
nd
- operator to processprocCtx
- operator processor contextnodeOutputs
- A variable argument list of outputs from other nodes in the walk
SemanticException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |