Query Vectorization for Apache Hive in CDH

By default, the Hive query execution engine processes one row of a table at a time. The single row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, data rows are batched together and represented as a set of column vectors. The basic idea of vectorized query execution is to process a batch of rows as an array of column vectors:

How Hive Query Vectorization Works

When query vectorization is enabled, the query engine processes vectors of columns, which greatly improves CPU utilization for typical query operations like scans, filters, aggregates, and joins.

Enabling Hive Query Vectorization

Hive query vectorization is enabled by default in CDH 6 and CDH 5. However, in CDH 5 vectorized query execution in Hive is only possible on ORC-formatted tables, which Cloudera recommends you do not use for overall compatibility with the CDH platform. Instead, Cloudera recommends that you use tables in the Parquet format because all CDH components support this format and it can be consumed by all CDH components.

Hive query vectorization is enabled or disabled for all file formats by setting the hive.vectorized.execution.enabled property to true or false and making sure that no value is set for the hive.vectorized.input.format.excludes property. To ensure that query vectorization is used for the Parquet file format, you must make sure that the hive.vectorized.input.format.excludes property is not set to org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.

Using Cloudera Manager to Enable or Disable Query Vectorization for Parquet Files on a Server-wide Basis

For managed clusters, open the Cloudera Manager Admin Console and perform the following steps:

  1. Select the Hive service.
  2. Click the Configuration tab.
  3. Search for enable vectorization.

    To view all the available vectorization properties for Hive, search for hiveserver2_vectorized. All the vectorization properties are in the Performance category.

  4. Select the Enable Vectorization Optimization option to enable query vectorization. To disable query vectorization, uncheck the box that is adjacent to HiveServer2 Default Group.



  5. To enable or disable Hive query vectorization for the Parquet file format, set the Exclude Vectorized Input Formats property in Cloudera Manager as follows:
    • To disable vectorization for Parquet files only, set this property to org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
    • To enable vectorization for all file formats including Parquet, set this property to Custom and leave the setting blank.
  6. Click Save Changes.
  7. Click the Instances tab, and then click the Restart the service (or the instance) for the changes to take effect:



Manually Enabling or Disabling Query Vectorization for Parquet Files on a Server-Wide Basis

To enable query vectorization for Parquet files on unmanaged clusters on a server-wide basis:

  • Set the hive.vectorized.execution.enabled property to true in the hive-site.xml file:
    <property>
         <name>hive.vectorized.execution.enabled</name>
         <value>true</value>
         <description>Enables query vectorization.</description>
    </property>
          
  • Ensure there is no value set for the hive.vectorized.input.format.excludes property in the hive-site.xml file:
    <property>
         <name>hive.vectorized.input.format.excludes</name>
         <value/>
         <description>Does not exclude query vectorization on any file format including Parquet.</description>
    </property>
          

To disable query vectorization for Parquet files only on unmanaged clusters on a server-wide basis:

  • Set the hive.vectorized.execution.enabled property to true in the hive-site.xml file:
    <property>
         <name>hive.vectorized.execution.enabled</name>
         <value>true</value>
         <description>Enables query vectorization.</description>
    </property>
          
  • Set the hive.vectorized.input.format.excludes property to org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat in the hive-site.xml file:
    <property>
         <name>hive.vectorized.input.format.excludes</name>
         <value>org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat</value>
         <description>Disables query vectorization on Parquet file formats only.</description>
    </property>
          

To enable query vectorization on all file formats:

  • Set the hive.vectorized.execution.enabled property to true in the hive-site.xml file:
    <property>
         <name>hive.vectorized.execution.enabled</name>
         <value>true</value>
         <description>Enables query vectorization.</description>
    </property>
          
  • Ensure there is no value set for the hive.vectorized.input.format.excludes property in the hive-site.xml file:
    <property>
         <name>hive.vectorized.input.format.excludes</name>
         <value/>
         <description>Does not exclude query vectorization on any file format.</description>
    </property>
          

To disable query vectorization on all file formats:

Set the hive.vectorized.execution.enabled property to false in the hive-site.xml file:
<property>
     <name>hive.vectorized.execution.enabled</name>
     <value>false</value>
     <description>Disables query vectorization on all file formats.</description>
</property>
      

Enabling or Disabling Hive Query Vectorization for Parquet Files on a Session Basis

Use the Hive SET command to enable or disable query vectorization on an individual session. Enabling or disabling query vectorization on a session basis is useful to test the effects of vectorization on the execution of specific sets of queries.

To enable query vectorization for all file formats including Parquet on an individual session only:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.input.format.excludes= ;
        
Setting hive.vectorized.input.format.excludes to a blank value ensures that this property is unset and that no file formats are excluded from query vectorization.
To disable query vectorization for Parquet files only on an individual session only:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.input.format.excludes=org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
        
To enable query vectorization for all file formats on an individual session only:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.input.format.excludes= ;
        
Setting hive.vectorized.input.format.excludes to a blank value ensures that this property is unset and that no file formats are excluded from query vectorization.
To disable query vectorization for all file formats on an individual session only:
SET hive.vectorized.execution.enabled=false;
          

Tuning Hive Query Vectorization

When query vectorization is enabled, there are additional properties you can set to tune how your queries are vectorized. These properties can be set in Cloudera Manager, can be set manually in the hive-site.xml file, or can be set on a per-query basis using the Hive SET command. Use the same general steps listed in the previous section to configure these properties in Cloudera Manager or manually.

hive.vectorized.adaptor.usage.mode
Description: Specifies the extent to which the vectorization engine tries to vectorize UDFs that do not have native vectorized versions available. Selecting the none option specifies that only queries using native vectorized UDFs are vectorized. Selecting the chosen option specifies that Hive chooses to vectorize a subset of the UDFs based on performance benefits using the Vectorized Adaptor. Selecting the all option specifies that the Vectorized Adaptor be used for all UDFs even when native vectorized versions are not available.
Recommendations: For optimum stability and correctness of query output, set this option to chosen.
Default Setting: chosen
hive.vectorized.execution.reduce.enabled
Description: Turns on or off vectorization for the reduce-side of query execution. Applies only when the execution engine is set to Spark.
Recommendations: Enable this property by setting it to true if you are using Hive on Spark. Otherwise, do not enable this property.
Default Setting: true
hive.vectorized.groupby.checkinterval
Description: For vectorized GROUP BY operations, specifies the number of row entries added to the hash table before rechecking the average variable size when estimating memory usage.
Recommendations: Current testing indicates that the default setting is applicable in most cases.
Default Setting: 4096
hive.vectorized.groupby.flush.percent
Description: Sets the percentage between 0 and 100 percent of entries in the vectorized GROUP BY aggregation hash that is flushed when the memory threshold is exceeded. To set no flushing, set this property to 0.0. To set flushing at 100 percent, set this property to 1.0.
Recommendations: This sets the amount of data that is held in memory. To increase performance, increase the setting. However, increase the setting conservatively to prevent out-of-memory issues.
Default Setting: 0.1, which sets the flush percentage to 10%
hive.vectorized.input.format.excludes
Description: Specifies input formats to exclude from vectorized query execution. You can select org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, or Custom from a drop-down list. When you select Custom, you can add another input format for exclusion, but currently no other format is supported. If you select Custom and leave the field blank, query vectorization is applied to all file formats.
Recommendations: Use this property to automatically disable certain file formats from vectorized execution. Cloudera recommends that you test your workloads on development clusters using vectorization and enable it in production if you receive significant performance advantages. As an example, if you want to exclude vectorization only on the ORC file format while keeping vectorization for all other file formats including the Parquet file format, set this property to org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.
Default Setting: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat which disables query vectorization for the Parquet file format only.
hive.vectorized.use.checked.expressions
Description: To enhance performance, vectorized expressions operate using wide data types like long and double. When wide data types are used, numeric overflows can occur during expression evaluation in a different manner for vectorized expressions than they do for non-vectorized expressions. Consequently, different query results can be returned for vectorized expressions compared to results returned for non-vectorized expressions. When enabled, Hive uses vectorized expressions that handle numeric overflows in the same way as non-vectorized expressions are handled.
Recommendations: Keep this property set to true, if you want results across vectorized and non-vectorized queries to be consistent.
Default Setting: true
hive.vectorized.use.vectorized.input.format
Description: Enables Hive to take advantage of input formats that support vectorization when they are available.
Recommendations: Enable this property by setting it to true if you have Parquet or ORC workloads that you want to be vectorized.
Default Setting: true
hive.vectorized.use.vector.serde.deserialize
Description: Enables Hive to use built-in vector SerDes to process text and SequenceFile tables for vectorized query execution. In addition, this configuration also helps vectorization of intermediate tasks in multi-stage query execution.
Recommendations: Keep this set to false. Setting this property to true might help multi-stage workloads, but when set to true, it enables text vectorization, which Cloudera does not support.
Default Setting: false

Supported/Unsupported Data Types and Functions

Most common data types and functions are supported by Hive query vectorization on Parquet tables in CDH. The following subsections provide more details about data type and function support.

Supported/Unsupported Data Types

Currently, some complex data types, such as map, list, and union are not supported for Hive query vectorization on Parquet tables in CDH. Even though the struct data type is supported, it is vectorized only when all of the fields defined within the struct are primitives. The following data types are supported.

Supported Data Types for Hive Query Vectorization on Parquet Tables
int smallint tinyint
bigint integer long
short timestamp interval_year_month
boolean binary string
byte float double
void struct

Supported/Unsupported Functions

Common arithmetic, boolean (for example AND, OR), comparison, mathematical (for example SIN, COS, LOG), date, and type-cast functions are supported. Also common aggregate functions such as MIN, MAX, COUNT, AVG, and SUM are also supported. If a function is not supported, the vectorizer attempts to vectorize the function based on the configuration value specified for hive.vectorized.adaptor.usage.mode. You can set this property to none or chosen. To set this property in Cloudera Manager, search for the hive.vectorized.adaptor.usage.mode property on the Configuration page for the Hive service, and set it to none or chosen as appropriate. For unmanaged clusters, set it manually in the hive-site.xml file for server-wide scope. To set it on a session basis, use the Hive SET command as described above.

Verifying a Query is Vectorized

To verify that a query is vectorized, use the EXPLAIN VECTORIZATION statement. This statement returns a query plan that shows how the Hive query execution engine processes your query and whether vectorization is being triggered.

Example of Verifying that Query Vectorization is Triggered for Your Query

This example uses the Hive table p_clients, which uses the Parquet format and contains the following columns and data types:

DESCRIBE p_clients;
⋮
+------------------+------------+----------+
|     col_name     | data_type  | comment  |
+------------------+------------+----------+
| name             | string     |          |
| symbol           | string     |          |
| lastsale         | double     |          |
| marketlabel      | string     |          |
| marketamount     | bigint     |          |
| ipoyear          | int        |          |
| segment          | string     |          |
| business         | string     |          |
| quote            | string     |          |
+------------------+------------+----------+
        

To get the query execution plan for a query, enter the following commands in a Beeline session:

EXPLAIN VECTORIZATION SELECT COUNT(*) FROM p_clients WHERE ipoyear = 2009;
        

This command returns the following query execution plan:

EXPLAIN VECTORIZATION Query Execution Plan for Hive Table Using the Parquet Format

Vectorization is explained in several parts of this query plan:

  1. The PLAN VECTORIZATION section shows a high-level view of the vectorization status for the query. The enabled flag set to true means that vectorization is turned on and the enabledConditionsMet flag shows that it is enabled because the hive.vectorized.execution.enabled property is set to true. If vectorization is not enabled, the enabledConditionsNotMet flag shows why.
  2. Then in the STAGE PLANS section, the output shows the vectorization status for each task of query execution. For example, there might be multiple map and reduce tasks for a query and it is possible that only a subset of these tasks are vectorized. In the above example, the Stage-1 sub-section shows there is only one map task and one reduce task. The Execution mode sub-section of the map task shows whether the task is vectorized. In this case, vectorized displays, which means that the vectorizer was able to successfully validate and vectorize all of the operators for this map task.
  3. The Map Vectorization sub-section shows more details of map task vectorization. Specifically, the configurations that affect the map side vectorization are shown along with whether these configurations are enabled. If the configurations are enabled, they are listed for enabledConditionsMet. If the configurations are not enabled, they are listed for enabledConditionsNotMet as explained in the above PLAN VECTORIZATION section. In this example, it shows that the map side of query execution is enabled because the hive.vectorized.use.vectorized.input.format property is set to true. This section also contains details about input file format and adaptor settings used in the map side of query execution.
  4. The Reduce Vectorization sub-section shows that the reduce side of query execution was not vectorized because the hive.vectorized.execution.reduce.enabled property is set to false. This sub-section also shows that the execution engine is not set to Tez or Spark, which are needed for reduce side vectorization. In this particular example, to enable reduce side vectorization, the execution engine should be set to Spark and the hive.vectorized.execution.reduce.enabled property should be set to true.

By using the EXPLAIN VECTORIZATION statement with your queries, you can find out before you deploy them whether vectorization will be triggered and what properties you must set to enable it.