This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

PARQUET_FILE_SIZE

Specifies the maximum size of each Parquet data file produced by Impala INSERT statements. For small or partitioned tables where the default Parquet block size of 1 GB is much larger than needed for each data file, you can increase parallelism by specifying a smaller size, resulting in more HDFS blocks that can be processed by different nodes. Reducing the file size also reduces the memory required to buffer each block before writing it to disk.

Specify the size in bytes, for example:

set PARQUET_FILE_SIZE=128000000;
INSERT INTO parquet_table SELECT * FROM text_table;

Default: 0 (produces files with a maximum size of 1 gigabyte)

For information about the Parquet file format, and how the number and size of data files affects query performance, see Using the Parquet File Format with Impala Tables.

Page generated September 3, 2015.