You can create a materialized view of a query to calculate and store results of an
expensive operation, such as join.
In this task, you create and populate example tables. You create a materialized view
of a join of the tables. Subsequently, when you run a query to join the tables, the
query plan takes advantage of the precomputed join to accelerate processing. This
task is over-simplified and is intended to show the syntax and output of a
materialized view, not to demonstrate accelerated processing that results in a
real-world task, which would process a large amount of data.
-
In the Hive shell or other Hive UI, create two tables:
CREATE TABLE emps (
empid INT,
deptno INT,
name VARCHAR(256),
salary FLOAT,
hire_date TIMESTAMP);
CREATE TABLE depts (
deptno INT,
deptname VARCHAR(256),
locationid INT);
-
Insert some data into the tables for example purposes:
INSERT INTO TABLE emps VALUES (10001,101,'jane doe',250000,'2018-01-10');
INSERT INTO TABLE emps VALUES (10002,100,'somporn klailee',210000,'2017-12-25');
INSERT INTO TABLE emps VALUES (10003,200,'jeiranan thongnopneua',175000,'2018-05-05');
INSERT INTO TABLE depts VALUES (100,'HR',10);
INSERT INTO TABLE depts VALUES (101,'Eng',11);
INSERT INTO TABLE depts VALUES (200,'Sup',20);
-
Create a materialized view to join the tables:
CREATE MATERIALIZED VIEW mv1
AS SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2017-01-01';
-
Execute a query that takes advantage of the precomputation performed by the
materialized view:
SELECT empid, deptname
FROM emps
JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2017-01-01'
AND hire_date <= '2019-01-01';
Output is:
+--------+-----------+
| empid | deptname |
+--------+-----------+
| 10003 | Sup |
| 10002 | HR |
| 10001 | Eng |
+--------+-----------+
-
Verify that the query rewrite used the materialized view by running an extended
EXPLAIN statement:
EXPLAIN EXTENDED SELECT empid, deptname
FROM emps
JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2017-01-01'
AND hire_date <= '2019-01-01';
The output shows the alias
default.mv1
for the
materialized view in the TableScan section of the plan.
OPTIMIZED SQL: SELECT `empid`, `deptname`
FROM `default`.`mv1`
WHERE TIMESTAMP '2019-01-01 00:00:00.000000000' >= `hire_date`
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
TableScan
alias: default.mv1
filterExpr: (hire_date <= TIMESTAMP'2019-01-01
00:00:00') (type: boolean) |
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (hire_date <= TIMESTAMP'2019-01-01
00:00:00') (type: boolean)
Select Operator
expressions: empid (type: int), deptname (type: varchar(256))
outputColumnNames: _col0, _col1
ListSink