Create and use a materialized view

You can create a materialized view of a query to calculate and store results of an expensive operation, such as join.

In this task, you create and populate example tables. You create a materialized view of a join of the tables. Subsequently, when you run a query to join the tables, the query plan takes advantage of the precomputed join to accelerate processing. This task is over-simplified and is intended to show the syntax and output of a materialized view, not to demonstrate accelerated processing that results in a real-world task, which would process a large amount of data.

  1. In Data Analytics Studio (DAS), the Hive shell, or other Hive UI, create two tables:
    CREATE TABLE emps (
       empid INT,
       deptno INT,
       name VARCHAR(256),
       salary FLOAT,
       hire_date TIMESTAMP);
                        
    CREATE TABLE depts (
       deptno INT,
       deptname VARCHAR(256),
       locationid INT);
  2. Insert some data into the tables for example purposes:
    INSERT INTO TABLE emps VALUES (10001,101,'jane doe',250000,'2018-01-10');
    INSERT INTO TABLE emps VALUES (10002,100,'somporn klailee',210000,'2017-12-25');
    INSERT INTO TABLE emps VALUES (10003,200,'jeiranan thongnopneua',175000,'2018-05-05');
                        
    INSERT INTO TABLE depts VALUES (100,'HR',10);
    INSERT INTO TABLE depts VALUES (101,'Eng',11);
    INSERT INTO TABLE depts VALUES (200,'Sup',20);               
  3. Create a materialized view to join the tables:
    CREATE MATERIALIZED VIEW mv1
      AS SELECT empid, deptname, hire_date
      FROM emps JOIN depts
      ON (emps.deptno = depts.deptno)
      WHERE hire_date >= '2017-01-01';
  4. Execute a query that takes advantage of the precomputation performed by the materialized view:
    SELECT empid, deptname
      FROM emps
      JOIN depts
      ON (emps.deptno = depts.deptno)
      WHERE hire_date >= '2017-01-01'
      AND hire_date <= '2019-01-01';
    Output is:
    +--------+-----------+
    | empid  | deptname  |
    +--------+-----------+
    | 10003  | Sup       |
    | 10002  | HR        |
    | 10001  | Eng       |
    +--------+-----------+              
  5. In Cloudera Manager, enable explain logging: Navigate to Clusters > HIVE_ON_TEZ-1 > Configuration, search for hive.log.explain.output, check HiveServer2 Default Group, and click Save Changes.
  6. Verify that the query rewrite used the materialized view by running an extended EXPLAIN statement:
    EXPLAIN EXTENDED SELECT empid, deptname
      FROM emps
      JOIN depts
      ON (emps.deptno = depts.deptno)
      WHERE hire_date >= '2017-01-01'
      AND hire_date <= '2019-01-01';
    The output shows the alias default.mv1 for the materialized view in the TableScan section of the plan.
    OPTIMIZED SQL: SELECT `empid`, `deptname`          
    FROM `default`.`mv1`                               
    WHERE TIMESTAMP '2019-01-01 00:00:00.000000000' >= `hire_date` 
    STAGE DEPENDENCIES:                                
      Stage-0 is a root stage                          
                                                                               
    STAGE PLANS:                                       
      Stage: Stage-0                                   
        Fetch Operator                                 
          limit: -1                                    
          Processor Tree:                              
            TableScan                                  
              alias: default.mv1                       
              filterExpr: (hire_date <= TIMESTAMP'2019-01-01 
                00:00:00') (type: boolean) |
              GatherStats: false                       
              Filter Operator                          
                isSamplingPred: false                  
                predicate: (hire_date <= TIMESTAMP'2019-01-01 
                  00:00:00') (type: boolean) 
                Select Operator                        
                  expressions: empid (type: int), deptname (type: varchar(256)) 
                  outputColumnNames: _col0, _col1      
                  ListSink