org.apache.hadoop.hive.ql.optimizer.stats.annotation
Class StatsRulesProcFactory.JoinStatsRule

java.lang.Object
  extended by org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.DefaultStatsRule
      extended by org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory.JoinStatsRule
All Implemented Interfaces:
NodeProcessor
Enclosing class:
StatsRulesProcFactory

public static class StatsRulesProcFactory.JoinStatsRule
extends StatsRulesProcFactory.DefaultStatsRule
implements NodeProcessor

JOIN operator can yield any of the following three cases

  • The values of join keys are disjoint in both relations in which case T(RXS) = 0 (we need histograms for this)
  • Join key is primary key on relation R and foreign key on relation S in which case every tuple in S will have a tuple in R T(RXS) = T(S) (we need histograms for this)
  • Both R & S relation have same value for join-key. Ex: bool column with all true values T(RXS) = T(R) * T(S) (we need histograms for this. counDistinct = 1 and same value)
  • In the absence of histograms, we can use the following general case

    Single attribute

    T(RXS) = (T(R)*T(S))/max(V(R,Y), V(S,Y)) where Y is the join attribute

    Multiple attributes

    T(RXS) = T(R)*T(S)/max(V(R,y1), V(S,y1)) * max(V(R,y2), V(S,y2)), where y1 and y2 are the join attributes

    Worst case: If no column statistics are available, then T(RXS) = joinFactor * max(T(R), T(S)) * (numParents - 1) will be used as heuristics. joinFactor is from hive.stats.join.factor hive config. In the worst case, since we do not know any information about join keys (and hence which of the 3 cases to use), we let it to the user to provide the join factor.

    For more information, refer 'Estimating The Cost Of Operations' chapter in "Database Systems: The Complete Book" by Garcia-Molina et. al.


    Constructor Summary
    StatsRulesProcFactory.JoinStatsRule()
               
     
    Method Summary
     Object process(Node nd, Stack<Node> stack, NodeProcessorCtx procCtx, Object... nodeOutputs)
              Generic process for all ops that don't have specific implementations.
     
    Methods inherited from class java.lang.Object
    equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Constructor Detail

    StatsRulesProcFactory.JoinStatsRule

    public StatsRulesProcFactory.JoinStatsRule()
    Method Detail

    process

    public Object process(Node nd,
                          Stack<Node> stack,
                          NodeProcessorCtx procCtx,
                          Object... nodeOutputs)
                   throws SemanticException
    Description copied from interface: NodeProcessor
    Generic process for all ops that don't have specific implementations.

    Specified by:
    process in interface NodeProcessor
    Overrides:
    process in class StatsRulesProcFactory.DefaultStatsRule
    Parameters:
    nd - operator to process
    procCtx - operator processor context
    nodeOutputs - A variable argument list of outputs from other nodes in the walk
    Returns:
    Object to be returned by the process call
    Throws:
    SemanticException


    Copyright © 2014 The Apache Software Foundation. All rights reserved.