Vectorization enables Hive to process a batch of rows together instead of processing one row at a time. Each batch is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. HIVE-4160 has the design document for vectorization and tracks the implementation of many subtasks.
Enable Vectorization in Hive
To enable vectorization, set this configuration parameter:
hive.vectorized.execution.enabled=true
When vectorization is enabled, Hive examines the query and the data to determine whether vectorization can be supported. If it cannot be supported, Hive will execute the query with vectorization turned off.
Log Information about Vectorized Execution of Queries
The Hive client logs whether a query execution is vectorized at the info
level. More detailed logs are printed at the debug
level. Client logs can also be configured to show up on the console.
Supported Functionality
The current implementation supports only single table read-only queries. DDL queries or DML queries are not supported.
The supported operators are selection, filter and group by.
Partitioned tables are supported and the following data types are supported:
tinyint
smallint
int
bigint
boolean
float
double
timestamp
string
char
varchar
binary
The following operators are supported:
Comparison: >, >=, <, <=, =, !=
Arithmetic: plus, minus, multiply, divide, modulo
Logical: AND, OR
Aggregates: sum, avg, count, min, max
Only the ORC file format is supported in the current implementation.
Unsupported Functionality
All data types, file formats, and functionality are currently unsupported.
Two unsupported features of particular interest are the logical expression
NOT
, and the cast
operator. For example, a query such as
select x,y from T where a = b
will not vectorize if a
is integer and b
is double. Although both int
and
double
are supported, casting of one to another is not supported.