6. Query Vectorization

Vectorization enables Hive to process a batch of rows together instead of processing one row at a time. Each batch is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. HIVE-4160 has the design document for vectorization and tracks the implementation of many subtasks.

Enable Vectorization in Hive

To enable vectorization, set this configuration parameter:

hive.vectorized.execution.enabled=true

When vectorization is enabled, Hive examines the query and the data to determine whether vectorization can be supported. If it cannot be supported, Hive will execute the query with vectorization turned off.

Log Information about Vectorized Execution of Queries

The Hive client logs whether a query execution is vectorized at the info level. More detailed logs are printed at the debug level. Client logs can also be configured to show up on the console.

Supported Functionality

The current implementation supports only single table read-only queries. DDL queries or DML queries are not supported.

The supported operators are selection, filter and group by.

Partitioned tables are supported and the following data types are supported:

tinyint
smallint
int
bigint
boolean
float
double
timestamp
string
char
varchar
binary

The following operators are supported:

Comparison: >, >=, <, <=, =, !=
Arithmetic: plus, minus, multiply, divide, modulo
Logical: AND, OR
Aggregates: sum, avg, count, min, max

Only the ORC file format is supported in the current implementation.

Unsupported Functionality

All data types, file formats, and functionality are currently unsupported.

Two unsupported features of particular interest are the logical expression NOT, and the cast operator. For example, a query such as select x,y from T where a = b will not vectorize if a is integer and b is double. Although both int and double are supported, casting of one to another is not supported.

Legal notices