Managing Apache Hive

Query vectorization

You can use vectorization to improve instruction pipelines and cache use. Vectorization enables certain data and queries to process batches of primitive types on entire column rather than one row at a time.

Default vectorized query execution🔗

CDP enables vectorization by default and the value of hive.vectorized.execution.enabled is set to true. Vectorized query execution processes Hive data in batch, channeling a large number of rows of data into columns, foregoing intermediate results. This technique is more efficient than contrary to the MapReduce execution process that stores temporary file.

Unsupported functionality on vectorized data 🔗

Some functionality is not supported on vectorized data:

DDL queries
DML queries other than single table, read-only queries
Formats other than Optimized Row Columnar (ORC)

Supported functionality on vectorized data🔗

The following functionality is supported on vectorized data:

Single table, read-only queries
Selecting, filtering, and grouping data is supported.
Partitioned tables
The following expressions:
- Comparison: >, >=, <, <=, =, !=
- Arithmetic plus, minus, multiply, divide, and modulo
- Logical AND and OR
- Aggregates sum, avg, count, min, and max

Supported data types🔗

You can query data of the following types using vectorized queries:

tinyint
smallint
int
bigint
date
boolean
float
double
timestamp
stringchar
varchar
binary

We want your opinion

How can we improve this page?

What kind of feedback do you have?