This is the documentation for CDH 5.0.x. Documentation for other versions is available at Cloudera Documentation.

SQL Differences Between Impala and Hive

The current release of Impala does not support the following SQL features that you might be familiar with from HiveQL:

  • Non-scalar data types such as maps, arrays, structs.
  • Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.
  • XML and JSON functions.
  • Certain aggregate functions from HiveQL: variance, var_pop, var_samp, stddev_pop, stddev_samp, covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeric, collect_set; Impala supports the set of aggregate functions listed in Aggregate Functions
  • Sampling.
  • Lateral views.
  • Multiple DISTINCT clauses per query.
      Note:

    Impala only allows a single COUNT(DISTINCT columns) expression in each query. To produce the same result as multiple COUNT(DISTINCT) expressions, you can use the following technique for queries involving a single table:

    select v1.c1 result1, v2.c1 result2 from
      (select count(distinct col1) as c1 from t1) v1
        cross join
      (select count(distinct col2) as c1 from t1) v2;
    

    If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by specifying COUNT(NDV(column)); a query can contain multiple instances of COUNT(NDV(column)).

User-defined functions (UDFs) are supported starting in Impala 1.2. See User-Defined Functions (UDFs) for full details on Impala UDFs.
  • Impala supports high-performance UDFs written in C++, as well as reusing some Java-based Hive UDFs.
  • Impala supports scalar UDFs and user-defined aggregate functions (UDAFs). Impala does not currently support user-defined table generating functions (UDTFs).
  • Only Impala-supported column types are supported in Java-based UDFs.

Authorization features such as roles are implemented differently in Impala. Impala utilizes the Apache Sentry (incubating) authorization framework, which provides fine-grained role-based access control to protect data against unauthorized access or tampering. See Enabling Sentry Authorization for Impala for the details of authorization in Impala. In constrast, the built-in Hive authorization feature is primarily to prevent accidental deletion of data, rather than a security mechanism to protect against malicious users, because users can grant themselves additional permissions.

Impala does not currently support these HiveQL statements:

  • ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
  • DESCRIBE COLUMN
  • DESCRIBE DATABASE
  • EXPORT TABLE
  • IMPORT TABLE
  • SHOW PARTITIONS
  • SHOW TABLE EXTENDED
  • SHOW INDEXES
  • SHOW COLUMNS

The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL statement and clause names:

  • Impala uses different syntax and names for query hints. See Joins for the Impala details.
  • Impala does not expose MapReduce specific features of SORT BY, DISTRIBUTE BY, or CLUSTER BY.
  • Impala does not require queries to include a FROM clause.
  • Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected casting behavior.
    • Impala does not implicitly cast between string and numeric or Boolean types. Always use CAST() for these conversions.
    • Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT or FLOAT, but to convert from DOUBLE to FLOAT or INT to TINYINT requires a call to CAST() in the query.
    • Impala does perform implicit casts from string to timestamp.

    See Data Types for full details on implicit and explicit casting for all types, and Type Conversion Functions for details about the CAST() function.

  • Impala has a restricted set of literal formats for the TIMESTAMP data type and the from_unixtime() format string; see TIMESTAMP Data Type for details.
  • Impala does not store timestamps using the local timezone, to avoid undesired results from unexpected time zone issues. Timestamps are stored relative to GMT.
  • Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions similar to how they do so with traditional database systems. Impala returns the largest or smallest value in the range for the type. For example, valid values for a tinyint range from -128 to 127. In Impala, a tinyint with a value of -200 returns -128 rather than NULL. A tinyint with a value of 200 returns 127.
  • Impala does not provide virtual columns.
  • Impala does not expose locking.
  • Impala does not expose some configuration properties.
Page generated September 3, 2015.