SQL differences between Impala and Hive

Impala's SQL syntax follows the SQL-92 standard, and includes extensions, such as built-in functions. Because Impala and Hive share the same Metastore database and their tables are often used interchangeably, this topic covers differences between Impala and Hive in detail.

The current release of Impala does not support the following SQL features that available in Hive SQL:

  • Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.

  • XML functions.

  • BINARY data type.

  • Certain aggregate functions from Hive SQL: covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeric, collect_set. Impala supports the set of aggregate functions and analytic functions.

  • Lateral views. In Impala 2.3 and higher, Impala supports queries on complex types (STRUCT, ARRAY, or MAP), using join notation rather than the EXPLODE() keyword.

User-defined functions (UDFs) are supported from Impala 1.2 onward.
  • Impala supports high-performance UDFs written in C++, as well as reusing some Java-based Hive UDFs.

  • Impala supports scalar UDFs and user-defined aggregate functions (UDAFs). Impala does not currently support user-defined table generating functions (UDTFs).

  • Only Impala-supported column types are supported in Java-based UDFs.

  • The Hive current_user() function cannot be called from a Java UDF through Impala.

Impala does not currently support these Hive SQL statements:

  • ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
  • DESCRIBE COLUMN
  • DESCRIBE DATABASE
  • EXPORT TABLE
  • IMPORT TABLE
  • SHOW TABLE EXTENDED
  • SHOW TBLPROPERTIES
  • SHOW INDEXES
  • SHOW COLUMNS
  • INSERT OVERWRITE DIRECTORY; use INSERT OVERWRITE table_name or CREATE TABLE AS SELECT to materialize query results into the HDFS directory associated with an Impala table.

Impala respects the serialization.null.format table property only for TEXT tables and ignores the property for Parquet and other formats. Hive respects the serialization.null.format property for Parquet and other formats and converts matching values to NULL during the scan. See the Data files for text tables section for using the table property in Impala.

Semantic differences between Impala and Hive SQL features

This section covers instances where Impala and Hive have similar functionality, sometimes including the same syntax, but there are differences in the runtime semantics of those features.

SQL statements and clauses:

  • Impala uses different syntax and names for query hints, [SHUFFLE] and [NOSHUFFLE] rather than MapJoin or StreamJoin.
  • Impala does not expose MapReduce specific features of SORT BY, DISTRIBUTE BY, or CLUSTER BY.
  • Impala does not require queries to include a FROM clause.

Data types:

  • Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected casting behavior.
    • Impala does not implicitly cast between string and numeric or Boolean types. Always use CAST() for these conversions.
    • Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT or FLOAT, but to convert from DOUBLE to FLOAT or INT to TINYINT requires a call to CAST() in the query.
    • Impala does perform implicit casts from STRING to TIMESTAMP. Impala has a restricted set of literal formats for the TIMESTAMP data type and the FROM_UNIXTIME() format string.

    See Impala SQL data types for full details on implicit and explicit casting for all types, and Impala type conversion functions for details about the CAST() function.

  • Impala does not store or interpret TIMESTAMP values using the local timezone, to avoid undesired results from unexpected time zone issues. Timestamps are stored and interpreted relative to UTC. This difference can produce different results for some calls to similarly named date/time functions between Impala and Hive. See TIMESTAMP data type for a discussion of how Impala handles time zones, and configuration options you can use to make Impala match the Hive behavior more closely when dealing with Parquet-encoded TIMESTAMP data or when converting between the local time zone and UTC.

  • The Impala TIMESTAMP type can represent dates ranging from 1400-01-01 to 9999-12-31. This is different from the Hive date range, which is 0000-01-01 to 9999-12-31.

  • Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions similar to how they do so with traditional database systems. Impala returns the largest or smallest value in the range for the type. For example, valid values for a tinyint range from -128 to 127. In Impala, a tinyint with a value of -200 returns -128 rather than NULL. A tinyint with a value of 200 returns 127.

Miscellaneous features:

  • Impala does not provide virtual columns.
  • Impala does not expose locking.
  • Impala does not expose some configuration properties.