Apache Parquet Known Issues

Parquet file writes run out of memory if (number of partitions) times (block size) exceeds available memory

The Parquet output writer allocates one block for each table partition it is processing and writes partitions in parallel. The MapReduce or YARN task will run out of memory if (number of partitions) times (Parquet block size) is greater than the available memory.

Cloudera Bug: CDH-20157, CDH-20253

Workaround: None; if necessary, reduce the number of partitions in the table.

parquet-thrift cannot read Parquet data written by Hive

parquet-thrift cannot read Parquet data written by Hive, and parquet-avro will show an additional record level in lists named array_element.

Bug: PARQUET-113

Cloudera Bug: CDH-22189, CDH-22220

Workaround: None; arrays written by parquet-avro or parquet-thrift cannot currently be read by parquet-hive.