Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. HIVE-4160 has the design document for vectorization and tracks the implementation of many subtasks.
To enable vectorization, set this configuration parameter:
hive.vectorized.execution.enabled=true
When vectorization is enabled, Hive examines the query and the data to determine whether vectorization can be supported. If it cannot be supported, Hive will execute the query with vectorization turned off.
The Hive client will log, at the info
level, whether a query's execution
is being vectorized.
More detailed logs are printed at the debug
level.
The client logs can also be configured to show up on the console.
The current implementation supports only single table read-only queries. DDL queries or DML queries are not supported.
The supported operators are selection, filter and group by.
Partitioned tables are supported.
These data types are supported:
tinyint
smallint
int
bigint
boolean
float
double
timestamp
string
These expressions are supported:
Comparison: >, >=, <, <= , =, !=
Arithmetic: plus, minus, multiply, divide, modulo
Logical: AND, OR
Aggregates: sum, avg, count, min, max
Only the ORC file format is supported in the current implementation.
All datatypes, file formats, and functionality not listed in the previous section are currently unsupported.
Two unsupported features of particular interest are the logical expression NOT and the
cast
operator.
For example, a query such as select x,y from T where a = b
will not vectorize
if a
is integer and b
is double.
Although both int and double are supported, casting of one to another is not supported.