2. New Feature: Vectorization

Vectorization allows Hive to process a batch of rows together instead of processing one row at a time. Each batch consists of a column vector which is usually an array of primitive types. Operations are performed on the entire column vector, which improves the instruction pipelines and cache usage. HIVE-4160 has the design document for vectorization and tracks the implementation of many subtasks.

 2.1. Enable Vectorization in Hive

To enable vectorization, set this configuration parameter:

  • hive.vectorized.execution.enabled=true

When vectorization is enabled, Hive examines the query and the data to determine whether vectorization can be supported. If it cannot be supported, Hive will execute the query with vectorization turned off.

 2.2. Log Information about Vectorized Execution of Queries

The Hive client will log, at the info level, whether a query's execution is being vectorized. More detailed logs are printed at the debug level.

The client logs can also be configured to show up on the console.

 2.3. Supported Functionality

The current implementation supports only single table read-only queries. DDL queries or DML queries are not supported.

The supported operators are selection, filter and group by.

Partitioned tables are supported.

These data types are supported:

  • tinyint

  • smallint

  • int

  • bigint

  • boolean

  • float

  • double

  • timestamp

  • string

These expressions are supported:

  • Comparison: >, >=, <, <= , =, !=

  • Arithmetic: plus, minus, multiply, divide, modulo

  • Logical: AND, OR

  • Aggregates: sum, avg, count, min, max

Only the ORC file format is supported in the current implementation.

 2.4. Unsupported Functionality

All datatypes, file formats, and functionality not listed in the previous section are currently unsupported.

Two unsupported features of particular interest are the logical expression NOT and the cast operator. For example, a query such as select x,y from T where a = b will not vectorize if a is integer and b is double. Although both int and double are supported, casting of one to another is not supported.