Apache Hive Components New Features in CDH 6.0

The following features have been added to Hive in CDH 6.0:

Query Vectorization Support for Parquet Files

By default, the Hive query execution engine processes one row of a table at a time. The single row of data goes through all the operators in the query before the next row is processed, resulting in very inefficient CPU usage. In vectorized query execution, data rows are batched together and represented as a set of column vectors. The query engine then processes these vectors of columns, which greatly reduces CPU usage for typical query operations like scans, filters, aggregates, and joins.

Hive query vectorization is enabled by setting the hive.vectorized.execution.enabled property to true. In both CDH 5 and CDH 6, this property is set to true by default. But in CDH 5, vectorized query execution in Hive is only possible on ORC-formatted tables, which Cloudera recommends that you do not use for overall compatibility with the CDH platform. Instead, Cloudera recommends that you use tables in the Parquet format because all CDH components support this format and can be consumed by all CDH components. In CDH 6, query vectorization is supported for Parquet tables in Hive.

For more information, see Query Vectorization for Apache Hive in CDH and Apache Parquet Tables with Hive in CDH.

Support for UNION DISTINCT

Support has been added for the UNION DISTINCT clause in HiveQL. See HIVE-9039 and the Apache wiki for more details. This feature introduces the following incompatible changes to HiveQL:

  • Behavior in CDH 5:

    • SORT BY, CLUSTER BY, ORDER BY, LIMIT, and DISTRIBUTE BY can be specified without delineating parentheses either before a UNION ALL clause or at the end of the query, resulting in the following behaviors:

      • When specified before, these clauses are applied to the query before UNION ALL is applied.
      • When specified at the end of the query, these clauses are applied to the query after UNION ALL is applied.
    • The UNION clause is equivalent to UNION ALL, in which no duplicates are removed.
  • Behavior in CDH 6:

    • SORT BY, CLUSTER BY, ORDER BY, LIMIT, and DISTRIBUTE BY can be specified without delineating parentheses only at the end of the query, resulting in the following behaviors:

      • These clauses are applied to the entire query.
      • Specifying these clauses before the UNION ALL clause results in a parsing error.
    • The UNION clause is equivalent to UNION DISTINCT, in which all duplicates are removed.

Support for NULLS FIRST/NULLS LAST

Support has been added for NULLS FIRST and NULLS LAST options. These options can be used to determine whether null values appear before or after non-null data values when the ORDER BY clause is used. Hive follows the SQL:2003 standard for this feature, but the SQL standard does not specify the behavior by default. By default in Hive, null values are sorted as if lower than non-null values. This means that NULLS FIRST is the default behavior for ASC order, and NULLS LAST is the default behavior for DESC order. See Syntax of Order By on the Apache Hive wiki and HIVE-12994 for further details.

Here are some usage examples:

SELECT x.* FROM table1 x ORDER BY a ASC NULLS FIRST;
SELECT x.* FROM table1 x ORDER BY a ASC NULLS LAST;
        

Added Support for Windowing and Analytics Functions

Support for the following has been added to CDH 6.0:

  • Using DISTINCT with windowing functions. See HIVE-9534 for details.
  • Support for ORDER BY and a windowing clause when DISTINCT is used in a partitioning clause. See HIVE-13453 for details.
  • Support to reference aggregate functions within the OVER clause. See HIVE-13475 for details.

For further details, see the Apache Language Manual on Windowing and Analytics.

Table or Partition Statistics Editing

Support has been added for editing the statistics information that is stored for a table or a partition. For example, you can run the following statement to set the number of rows for a table to 1000:

ALTER TABLE table1 UPDATE STATISTICS SET ('numRows'='1000');
        

For more information, see HIVE-12730 and the Apache wiki.

SHOW CREATE DATABASE Support

Support has been added for the SHOW CREATE DATABASE command, which prints out the DDL statement that was used to create a database:

SHOW CREATE DATABASE database1;
        

For more information, see HIVE-11706

Support for Multiple-Column IN Clause

Support has been added so that the IN clause in queries operates over multiple column references. The new syntax is boldfaced in the following example:

CREATE TABLE test (col1 int, col2 int);
INSERT INTO TABLE test VALUES (1, 1), (1, 2), (2, 1), (2, 2);
SELECT * FROM test;
+------------+------------+
| test.col1  | test.col2  |
+------------+------------+
| 1          | 1          |
| 1          | 2          |
| 2          | 1          |
| 2          | 2          |
+------------+------------+
SELECT * FROM test WHERE (col1, col2) IN ((1, 1));
+------------+------------+
| test.col1  | test.col2  |
+------------+------------+
| 1          | 1          |
+------------+------------+
        

For more information, see HIVE-11600

Support for More Hive Functions

Support has been added for the following Hive UDFs:

bround chr factorial
floor_day floor_hours floor_minute
floor_month floor_quarter floor_second
floor_week floor_year grouping
mask mask_first_n mask_hash
mask_last_n mask_show_first_n mask_show_last_n
quarter replace sha1
sha shiftleft shiftright
shiftrightunsigned substring_index

All built-in Hive functions can be listed with the command SHOW FUNCTIONS; and a short description that explains what a function does is returned with the command DESCRIBE <function_name>; For more information about Hive functions, see the Apache wiki and Managing UDFs in the Cloudera enterprise documentation.