UTF-8 Support

Impala has traditionally offered a single-byte binary character set for STRING data type with the character data encoded in an ASCII character set. This release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with Hive on UTF-8 strings using a query option.

Turning ON the UTF-8 behavior

Before this release, Impala was incompatible with Hive in some functions applying on non-ASCII strings. For example, the length() function in Impala returns the length of bytes of the string, while the length() function in Hive returns the length of UTF-8 characters of the string. UTF-8 characters (code points) assemble in variant-length bytes (1~4 bytes), so the results differ for non-ASCII characters in the string.

UTF-8 support enables you to read and write UTF-8 from standard formats like Parquet and ORC, thus improving interoperability with other engines supporting those standard formats.

You can use the new query option, UTF8_MODE, to enable and disable the UTF-8 aware behavior. You can set it globally or at a session level. Only queries with UTF8_MODE=true will have UTF-8-related behaviors.

List of STRING functions

The new query option enables the UTF-8 aware behavior of the following string functions:

  • LENGTH(STRING a)
    • returns the number of UTF-8 characters instead of bytes
  • SUBSTR(STRING a, INT start [, INT len])
  • SUBSTRING(STRING a, INT start [, INT len])()
    • the substring start position and length are counted as UTF-8 characters instead of bytes
  • REVERSE(STRING a)
    • the unit of the operation is a UTF-8 character. That is, it does not reverse bytes inside a UTF-8 character.
  • INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
  • LOCATE(STRING substr, STRING str[, INT pos])
    • These functions have an optional position argument. The return values are also positions in the string. In UTF-8 mode, these positions are counted as UTF-8 characters instead of bytes.
  • mask functions
    • The unit of the operation is a UTF-8 character. That is, they do not mask the string byte-to-byte.
  • upper/lower/initcap
    • These functions recognize non-ascii characters and transform them based on the current locale used by the Impala process.

Limitations

  • Use the UTF8_MODE option only when needed because the performance of UTF-8 is not optimized yet. It is an experimental feature.
  • UTF-8 support for CHAR and VARCHAR types is not implemented yet. So VARCHAR(N) returns N bytes instead of N UTF-8 characters.