File Formats and Compression

CDH supports all standard Hadoop file formats. For information about the file formats, see the File-Based Data Structures section of the Hadoop I/O chapter in Hadoop: The Definitive Guide.

The file format has a significant impact on performance. Use Avro if your use case typically scans or retrieves all of the fields in a row in each query. Parquet is a better choice if your dataset has many columns, and your use case typically involves working with a subset of those columns instead of entire records. For more information, see this Parquet versus Avro benchmark study.

All file formats include support for compression, which affects the size of data on the disk and, consequently, the amount of I/O and CPU resources required to serialize and deserialize data.

Continue reading:

Parquet
Avro
Data Compression

Categories: Compression | Data Analysts | Developers | File Formats | All Categories

Apache Spark Overview

Parquet