org.apache.hadoop.hive.ql.optimizer.stats.annotation
Class StatsRulesProcFactory.GroupByStatsRule

java.lang.Object
  extended by org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.DefaultStatsRule
      extended by org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.GroupByStatsRule
All Implemented Interfaces:
NodeProcessor
Enclosing class:
StatsRulesProcFactory

public static class StatsRulesProcFactory.GroupByStatsRule
extends StatsRulesProcFactory.DefaultStatsRule
implements NodeProcessor

GROUPBY operator changes the number of rows. The number of rows emitted by GBY operator will be atleast 1 or utmost T(R) (number of rows in relation T) based on the aggregation. A better estimate can be found if we have column statistics on the columns that we are grouping on.

Suppose if we are grouping by attributes A,B,C and if statistics for columns A,B,C are available then a better estimate can be found by taking the smaller of product of V(R,[A,B,C]) (product of distinct cardinalities of A,B,C) and T(R)/2.

T(R) = min (T(R)/2 , V(R,[A,B,C]) ---> [1]

In the presence of grouping sets, map-side GBY will emit more rows depending on the size of grouping set (input rows * size of grouping set). These rows will get reduced because of map-side hash aggregation. Hash aggregation is an optimization in hive to reduce the number of rows shuffled between map and reduce stage. This optimization will be disabled if the memory used for hash aggregation exceeds 90% of max available memory for hash aggregation. The number of rows emitted from map-side will vary if hash aggregation is enabled throughout execution or disabled. In the presence of grouping sets, following rules will be applied

If hash-aggregation is enabled, for query SELECT * FROM table GROUP BY (A,B) WITH CUBE

T(R) = min(T(R)/2, T(R, GBY(A,B)) + T(R, GBY(A)) + T(R, GBY(B)) + 1))

where, GBY(A,B), GBY(B), GBY(B) are the GBY rules mentioned above [1]

If hash-aggregation is disabled, apply the GBY rule [1] and then multiply the result by number of elements in grouping set T(R) = T(R) * length_of_grouping_set. Since we do not know if hash-aggregation is enabled or disabled during compile time, we will assume worst-case i.e, hash-aggregation is disabled

NOTE: The number of rows from map-side GBY operator is dependent on map-side parallelism i.e, number of mappers. The map-side parallelism is expected from hive config "hive.stats.map.parallelism". If the config is not set then default parallelism of 1 will be assumed.

Worst case: If no column statistics are available, then T(R) = T(R)/2 will be used as heuristics.

For more information, refer 'Estimating The Cost Of Operations' chapter in "Database Systems: The Complete Book" by Garcia-Molina et. al.


Constructor Summary
StatsRulesProcFactory.GroupByStatsRule()
           
 
Method Summary
 Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, Object... nodeOutputs)
          Generic process for all ops that don't have specific implementations.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

StatsRulesProcFactory.GroupByStatsRule

public StatsRulesProcFactory.GroupByStatsRule()
Method Detail

process

public Object process(Node nd,
                      Stack<Node> stack,
                      NodeProcessorCtx procCtx,
                      Object... nodeOutputs)
               throws SemanticException
Description copied from interface: NodeProcessor
Generic process for all ops that don't have specific implementations.

Specified by:
process in interface NodeProcessor
Overrides:
process in class StatsRulesProcFactory.DefaultStatsRule
Parameters:
nd - operator to process
procCtx - operator processor context
nodeOutputs - A variable argument list of outputs from other nodes in the walk
Returns:
Object to be returned by the process call
Throws:
SemanticException


Copyright © 2014 The Apache Software Foundation. All rights reserved.