PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (CDH 5.8 or higher only)

Causes Impala INSERT and CREATE TABLE AS SELECT statements to write Parquet files that use the UTF-8 annotation for STRING columns.

Usage notes:

By default, Impala represents a STRING column in Parquet as an unannotated binary field.

Impala always uses the UTF-8 annotation when writing CHAR and VARCHAR columns to Parquet files. An alternative to using the query option is to cast STRING values to VARCHAR.

This option is to help make Impala-written data more interoperable with other data processing engines. Impala itself currently does not support all operations on UTF-8 data. Although data processed by Impala is typically represented in ASCII, it is valid to designate the data as UTF-8 when storing on disk, because ASCII is a subset of UTF-8.

Type: Boolean; recognized values are 1 and 0, or true and false; any other value interpreted as false

Default: false (shown as 0 in output of SET statement)

Added in: CDH 5.8.0 / Impala 2.6.0

Related information:

Using the Parquet File Format with Impala Tables