Apache Hive Components Incompatible Changes in CDH 6.0

UNION ALL Statements Involving Data Types from Different Type Groups No Longer Use Implicit Type Casting

Prior to this change, Hive performed implicit casts when data types from different type groups were specified in queries that use UNION ALL. For example, before CDH 6.0, if you had the two following tables:

Table "one"

+------------+------------+------------+--+
| one.col_1  | one.col_2  | one.col_3  |
+------------+------------+------------+--+
| 21         | hello_all  | b          |
+------------+------------+------------+--+
        

Where col_1 datatype is int, col_2 datatype is string, and col_3 datatype is char(1).

Table "two"

+------------+------------+------------+--+
| two.col_4  | two.col_5  | two.col_6  |
+------------+------------+------------+--+
| 75.0       | abcde      | 45         |
+------------+------------+------------+--+
        

Where col_4 datatype is double, col_5 datatype is varchar(5), and col_6 datatype is int.

And you ran the following UNION ALL query against these two tables:

SELECT * FROM one UNION ALL SELECT col_4 AS col_1, col_5 AS col_2, col_6 AS
col_3 FROM two;
        

You received the following result set:

+------------+------------+------------+--+
| _u1.col_1  | _u1.col_2  | _u1.col_3  |
+------------+------------+------------+--+
| 75.0       | abcde      | 4          |
| 21.0       | hello      | b          |
+------------+------------+------------+--+
        

Note that this statement implicitly casts the values from table one with the following errors resulting in data loss:

  • one.col_1 is cast to a double datatype
  • one.col_2 is cast to a varchar(5) datatype, which truncates the original value from hello_all to hello
  • one.col_3 is cast to a char(1) datatype, which truncates the original value from 45 to 4

In CDH 6.0, no implicit cast is performed across different type groups. For example, STRING, CHAR, and VARCHAR are in one type group, and INT, BIGINT, and DECIMAL are in another type group, and so on. So, in CDH 6.0 and later, the above query that uses UNION ALL, returns an exception for the columns that contain datatypes that are not part of a type group. In CDH 6.0 and later, Hive performs the implicit cast only within type groups and not across different type groups. For more information, see HIVE-14251.

OFFLINE and NO_DROP Options Removed from Table and Partition DDL

Support for Hive table and partition protection options have been removed in CDH 6.0, which includes removal of the following functionality:

  • Support has been removed for:

    • ENABLE | DISABLE NO_DROP [CASCADE]
    • ENABLE | DISABLE OFFLINE
    • ALTER TABLE … IGNORE PROTECTION
  • The following support has also been removed from the HiveMetastoreClient class:

    The ignoreProtection parameter has been removed from the dropPartitions methods in the IMetaStoreClient interface.

For more information, see HIVE-11145.

Cloudera recommends that you use Apache Sentry to replace most of this functionality. Although Sentry governs permissions on ALTER TABLE, it does not include permissions that are specific to a partition. See Authorization Privilege Model for Hive and Impala and Configuring the Sentry Service.

DESCRIBE Query Syntax Change

In CDH 6.0 syntax has changed for DESCRIBE queries as follows:

  • DESCRIBE queries where the column name is separated by the table name using a period is no longer supported:

    DESCRIBE testTable.testColumn;
                

    Instead, the table name and column name must be separated with a space:

    DESCRIBE testTable testColumn;
                
  • The partition_spec must appear after the table name, but before the optional column name:

    DESCRIBE default.testTable PARTITION (part_col = 100) testColumn;
                

For more details, see the Apache wiki and HIVE-12184.

CREATE TABLE Change: Periods and Colons No Longer Allowed in Column Names

In CDH 6.0, CREATE TABLE statements fail if any of the specified column names contain a period or a colon. For more information, see HIVE-10120 and the Apache wiki.

Reserved and Non-Reserved Keyword Changes in HiveQL

Hive reserved and non-reserved keywords have changed in CDH 6.0. Reserved keywords cannot be used as table or column names unless they are enclosed with back ticks (for example, `data`). Non-reserved keywords can be used as table or column names without enclosing them with back ticks. Non-reserved keywords have proscribed meanings in HiveQL, but can still be used as table or column names. For more information about the changes to reserved and non-reserved words listed below, see HIVE-6617 and HIVE-14872.

In CDH 6.0, the following changes have occurred with Hive reserved and non-reserved keywords:

Hive New Reserved Keywords Added in CDH 6.0

The following table contains new reserved keywords that have been added:

COMMIT CONSTRAINT DEC EXCEPT
FOREIGN INTERVAL MERGE NUMERIC
ONLY PRIMARY REFERENCES ROLLBACK
START

Hive Non-Reserved Keywords Converted to Reserved Keywords in CDH 6.0

The following table contains non-reserved keywords that have been converted to be reserved keywords:

ALL ALTER ARRAY AS
AUTHORIZATION BETWEEN BIGINT BINARY
BOOLEAN BOTH BY CREATE
CUBE CURSOR DATE DECIMAL
DOUBLE DELETE DESCRIBE DROP
EXISTS EXTERNAL FALSE FETCH
FLOAT FOR FULL GRANT
GROUP GROUPING IMPORT IN
INT INNER INSERT INTERSECT
INTO IS LATERAL LEFT
LIKE LOCAL NONE NULL
OF ORDER OUT OUTER
PARTITION PERCENT PROCEDURE RANGE
READS REGEXP REVOKE RIGHT
RLIKE ROLLUP ROW ROWS
SET SMALLINT TABLE TIMESTAMP
TO TRIGGER TRUNCATE UNION
UPDATE USER USING VALUES
WITH TRUE

Hive Reserved Keywords Converted to Non-Reserved Keywords in CDH 6.0

The following table contains reserved keywords that have been converted to be non-reserved keywords:

CURRENT_DATE CURRENT_TIMESTAMP HOLD_DDLTIME IGNORE
NO_DROP OFFLINE PROTECTION READONLY

Hive New Non-Reserved Keywords Added in CDH 6.0

The following table contains new non-reserved keywords that have been added:

ABORT AUTOCOMMIT CACHE DAY
DAYOFWEEK DAYS DETAIL DUMP
EXPRESSION HOUR HOURS ISOLATION
KEY LAST LEVEL MATCHED
MINUTE MINUTES MONTH MONTHS
NORELY NOVALIDATE NULLS OFFSET
OPERATOR RELY SECOND SECONDS
SNAPSHOT STATUS SUMMARY TRANSACTION
VALIDATE VECTORIZATION VIEWS WAIT
WORK WRITE YEAR YEARS

Hive Non-Reserved Keyword Removed in CDH 6.0

The following non-reserved keyword has been removed:

DEFAULT

Apache Hive API Changes in CDH 6.0

AddPartitionMessage.getPartitions() Can Return NULL

The getPartitions() method has been removed from the AddPartitionEvent class in the org.apache.hadoop.hive.metastore.events interface. It was removed to prevent out-of-memory errors when the list of partitions is too large.

Instead use the getPartitionIterator() method. For more information, see HIVE-9609 and the AddPartitionEvent documentation.

DropPartitionEvent and PreDropPartitionEvent Class Changes

The getPartitions() method has been removed and replaced by the getPartitionIterator() method in the DropPartitionEvent class and the PreDropPartitionEvent class.

In addition, the (Partition partition, boolean deleteData, HiveMetastore.HMSHandler handler) constructors have been deleted from the PreDropPartitionEvent class. For more information, see HIVE-9674 and the PreDropPartitionEvent documentation.

GenericUDF.getTimestampValue Method Now Returns Timestamp Instead of Date

The getTimestampValue method in the GenericUDF class now returns a TIMESTAMP value instead of a DATE value. For more information, see HIVE-10275 and the GenericUDF documentation.

GenericUDF.getConstantLongValue Has Been Removed

The getConstantLongValue method has been removed from the GenericUDF class. It has been noted by the community that this method is not used in Hive. For more information, see HIVE-10710 and the GenericUDF documentation.

Increased Width of Hive Metastore Configuration Columns

The columns used for configuration values in the Hive metastore have been increased in width, resulting in the following incompatible changes in the org.apache.hadoop.hive.metastore.api interface.

This change introduced an incompatible change to the get_table_names_by_filter method of the ThriftHiveMetastore class. Before this change, this method accepts a string filter, which allows clients to filter a table by its TABLEPROPERTIES value. For example:

org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_
       PARAMS + "test_param_1 <> \"yellow\"";

org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_
       PARAMS + "test_param_1 = \"yellow\"";
              

After this change, the TABLE_PARAMS.PARAM_VALUE column is now a CLOB data type. Depending on the type of database that you use (for example, MySQL, Oracle, or PostgresSQL), the semantics may have changed and operators like "=", "<>", and "!=" might not be supported. Refer to the documentation for your database for more information. You must use operators that are compatible with CLOB data types. There is no equivalent "<>" operator that is compatible with CLOB. So there is no equivalent operator for the above example that uses the "<>" inequality operator. The equivalent for "=" is the LIKE operator so you would rewrite the second example above as:

org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.HIVE_FILTER_FIELD_
        PARAMS + "test_param_1 LIKE \"yellow"";
          

For more information, see HIVE-12274.

Apache Hive Configuration Changes in CDH 6.0

Bucketing and Sorting Enforced by Default When Inserting Data into Hive Tables

The configuration properties hive.enforce.sorting and hive.enforce.bucketing have been removed. When set to false, these configurations disabled enforcement of sorted and bucketed tables when data was inserted into a table. Removing these configuration properties effectively sets these properties to true. In CDH 6.0, bucketing and sorting are enforced on Hive tables during insertions and cannot be turned off. For more information, see the Apache wiki topic on hive.enforce.bucketing and the topic on hive.enforce.sorting.

Hive Throws an Exception When Processing HDFS Directories Containing Unsupported Characters

Directories in HDFS can contain unprintable or unsupported characters that are not visible even when you run the hadoop fs -ls command on the directories. When external tables are created with the MSCK REPAIR TABLE command, the partitions using these HDFS directories that contain unsupported characters are unusable for Hive. To avoid this, the configuration parameter hive.msck.path.validation has been added. This configuration property controls the behavior of the MSCK REPAIR TABLE command, enabling you to set whether validation checks are run on the HDFS directories when MSCK REPAIR TABLE is run.

The property hive.msck.path.validation can be set to one of the following values:

Value Name Description
throw Causes Hive to throw an exception when it tries to process an HDFS directory that contains unsupported characters with the MSCK REPAIR TABLE command. This is the default setting for hive.msck.path.validation.
skip Causes Hive to skip the skip the directories that contain unsupported characters, but still repairs the others.
ignore Causes Hive to completely skip any validation of HDFS directories when the MSCK REPAIR TABLE command is run. This setting can cause bugs because unusable partitions are created.

By default, the hive.msck.path.validation property is set to throw, which causes Hive to throw an exception when MSCK REPAIR TABLE is run and HDFS directories containing unsupported characters are encountered. To work around this, set this property to skip until you can repair the HDFS directories that contain unsupported characters.

To set this property in Cloudera Manager:

  1. In the Admin Console, select the Hive service.
  2. Click the Configuration tab.
  3. Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting.
  4. In the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting, add the Name of the property, the Value (throw, skip, or ignore), and a Description of the setting.
  5. Click Save Changes and restart the service.

For more information, see HIVE-10722.

Hive Strict Checks Have Been Re-factored To Be More Granular

Originally, the configuration property hive.mapred.mode was added to restrict certain types of queries from running. Now it has been broken down into more fine-grained configurations, one for each type of restricted query pattern. The configuration property hive.mapred.mode has been removed and replaced with the following configuration properties, which provide more granular control of Hive strict checks:

Configuration Property Description Default Value
hive.strict.checks.bucketing When set to true, running LOAD DATA queries against bucketed tables is not allowed. true. This is a backwards incompatible change.
hive.strict.checks.type.safety When set to true, comparing bigint to string data types or bigint to double data types is not allowed. true. This is a backwards incompatible change.
hive.strict.checks.orderby.no.limit When set to true, prevents queries from being run that contain an ORDER BY clause with no LIMIT clause. false
hive.strict.checks.no.partition.filter When set to true, prevents queries from being run that scan a partitioned table but do not filter on the partition column. false
hive.strict.checks.cartesian.product When set to true, prevents queries from being run that contain a Cartesian product (also known as a cross join). false

All of these properties can be set with Cloudera Manager in the following configuration settings for the Hive service:

  • Restrict LOAD Queries Against Bucketed Tables (hive.strict.checks.bucketing)
  • Restrict Unsafe Data Type Comparisons (hive.strict.checks.type.safety)
  • Restrict Queries with ORDER BY but no LIMIT clause (hive.strict.checks.orderby.no.limit)
  • Restrict Partitioned Table Scans with no Partitioned Column Filter (hive.strict.checks.no.partition.filter)
  • Restrict Cross Joins (Cartesian Products) (hive.strict.checks.cartesian.product)

For more information about these configuration properties, see HIVE-12727, HIVE-15148, HIVE-18251, and HIVE-18552.

Java XML Serialization Has Been Removed

The configuration property hive.plan.serialization.format has been removed. Previously, this configuration property could be set to either javaXML or kryo. Now the default is kryo serialization, which cannot be changed. For more information, see HIVE-12609 and the Apache wiki.

Configuration Property Enabling Column Position Usage with GROUP BY and ORDER BY Separated into Two Properties

The configuration property hive.groupby.orderby.position.alias, which enabled using column position with the GROUP BY and the ORDER BY clauses has been removed and replaced with the following two configuration properties. These configuration properties enable using column position with GROUP BY and ORDER BY separately:

Configuration Property Name Description/Default Setting Possible Values
hive.groupby.position.alias When set to true, specifies that columns can be referenced with their position when using GROUP BY clauses in queries. Default Setting: false. This behavior is turned off by default. true | false
hive.orderby.position.alias When set to true, specifies that columns can be referenced with their position when using ORDER BY clauses in queries. Default Setting: true. This behavior is turned on by default. true | false

For more information, see HIVE-15797 and the Apache wiki entries for configuration properties, GROUP BY syntax, and ORDER BY syntax.

HiveServer2 Impersonation Property (hive.server2.enable.impersonation) Removed

In earlier versions of CDH, the following two configuration properties could be used to set impersonation for HiveServer2:

  • hive.server2.enable.impersonation
  • hive.server2.enable.doAs

In CDH 6.0, hive.server2.enable.impersonation is removed. To configure impersonation for HiveServer2, use the configuration property hive.server2.enable.doAs. To set this property in Cloudera Manager, select the Hive service and click on the Configuration tab. Then search for the HiveServer2 Enable Impersonation setting and select the checkbox to enable HiveServer2 impersonation. This property is enabled by default in CDH 6.

For more information about this property, see the Apache wiki documentation for HiveServer2 configuration properties.

Changed Default File Format for Storing Intermediate Query Results

The configuration property hive.query.result.fileformat controls the file format in which a query's intermediate results are stored. In CDH 6, the default setting for this property has been changed from TextFile to SequenceFile.

To change this configuration property in Cloudera Manager:

  1. In the Admin Console, select the Hive service and click on the Configuration tab.
  2. Then search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting and add the following information:

    • Name: hive.query.result.fileformat
    • Value: Valid values are TextFile, SequenceFile (default), or RCfile
    • Description: Sets the file format in which a query's intermediate results are stored.
  3. After you add this information, click Save Changes and restart the Hive service.

For more information about this parameter, see the Apache wiki.

HiveServer2 Thrift API Code Repackaged Resulting in Class File Location Changes

HiveServer2 Thrift API code has been repackaged in CDH 6.0, resulting in the following changes:

  • All files generated by the Thrift API for HiveServer2 have moved from the following old namespace:

    org.apache.hive.service.cli.thrift

    To the following new namespace:

    org.apache.hive.service.rpc.thrift

  • All files generated by the Thrift API for HiveServer2 have moved into a separate jar file called service-rpc.

As a result of these changes, all Java classes such as TCLIService.java, TOpenSessionReq.java, TSessionHandle.java, and TGetSchemasReq.java have changed locations. For more information, see HIVE-12442.

Values Returned for Decimal Numbers Are Now Padded with Trailing Zeroes to the Scale of the Specified Column

Decimal values that are returned in query results are now padded with trailing zeroes to match the specified scale of the corresponding column. For example, before this change, when Hive read a decimal column with a specified scale of 5, the value returned for zero was returned as 0. Now, the value returned for zero is 0.00000. For more information, see HIVE-12063.

Hive Logging Framework Switched to SLF4J/Log4j 2

The logging framework for Hive has switched to SLF4J (Simple Logging Facade for Java) and now uses Log4j 2 by default. Use of Log4j 1.x, Apache Commons Logging, and java.util.logging have been removed. To accommodate this change, write all Log4j configuration files to be compatible with Log4j 2.

For more information, see HIVE-12237, HIVE-11304, and the Apache wiki.

Deprecated Parquet Java Classes Removed from Hive

The deprecated parquet classes, parquet.hive.DeprecatedParquetInputFormat and parquet.hive.DeprecatedParquetOutputFormat have been removed from Hive because they resided outside of the org.apache namespace. Any existing tables that use these classes are automatically migrated to the new SerDe classes when the metastore is upgraded.

Use one of the following options for specifying the Parquet SerDe for new Hive tables:

  • Specify in the CREATE TABLE statement that you want it stored as Parquet. For example:

    CREATE TABLE <parquet_table_name> (col1 INT, col2 STRING) STORED AS PARQUET;
                
  • Set the INPUTFORMAT to org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat and set the OUTPUTFORMAT to org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat. For example:

    CREATE TABLE <parquet_table_name> (col1 INT, col2 STRING)
    STORED AS
         INPUTFORMAT "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
         OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat";
                

For more information, see HIVE-6757 and the Apache wiki.

Removed JDBC, Counter-based, and HBase-based Statistics Collection Mechanisms

Support for JDBC, counter-based, and HBase-based statistics collection mechanisms has been removed from Hive. The following configuration properties are no longer supported:

  • hive.stats.dbclass
  • hive.stats.retries.wait
  • hive.stats.retries.max
  • hive.stats.jdbc.timeout
  • hive.stats.dbconnectionstring
  • hive.stats.jdbcdrive
  • hive.stats.key.prefix.reserve.length

This change also removed the cleanUp(String keyPrefix) method from the StatsAggregator interface.

Now all Hive statistics are collected on the default file system. For more information, see HIVE-12164, HIVE-12411, HIVE-12005, and the Apache wiki.

S3N Connector Is Removed from CDH 6.0

The S3N connector, which is used to connect to the Amazon S3 file system from Hive has been removed from CDH 6.0. To connect to the S3 file system from Hive in CDH 6.0, you must now use the S3A connector. There are a number of differences between the S3N and the S3A connectors, including configuration differences. See the Apache wiki page on integrating with Amazon Web Services for details.

Migration involves making the following changes:

  • Changing all metastore data containing URIs that start with s3n:// to s3a://. This change is performed automatically when you upgrade the Hive metastore.
  • Changing all scripts containing URIs that start with s3n:// to s3a://. You must perform this change manually.

Columns Added to TRowSet Returned by the Thrift TCLIService#GetTables Request

Six additional columns have been added to the TRowSet that is returned by the TCLIService#GetTables request. These columns were added to comply with the official JDBC API. For more information, see the documentation for java.sql.DatabaseMetaData.

The columns added are:

Column Name Description
REMARKS Explanatory comment on the table.
TYPE_CAT Types catalog.
TYPE_SCHEMA Types schema.
TYPE_NAME Types name.
SELF_REFERENCING_COL_NAME Name of the designed identifier column of a typed table.
REF_GENERATION Specifies how values in the SELF_REFERENCING_COL_NAME column are created.

For more information, see HIVE-7575.

Support Added for Escaping Carriage Returns and New Line Characters for Text Files (LazySimpleSerDe)

Support has been added for escaping carriage returns and new line characters in text files by modifying the LazySimpleSerDe class. Without this change, carriage returns and new line characters are interpreted as delimiters, which causes incorrect query results.

This feature is controlled by the SerDe property serialization.escape.crlf. It is enabled (set to true) by default. If serialization.escape.crlf is enabled, 'r' or 'n' cannot be used as separators or field delimiters.

This change only affects text files and removes the getNullString method from the LazySerDeParameters class. For more information, see HIVE-11785.