Avoiding small files

To reduce the amount of memory used by the Catalog for metadata, avoid creating many small files in HDFS.

By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.

Small files in HDFS can be caused by either having partitions that are too granular or by performing data ingestion too frequently. Cloudera recommends that you regularly compact small files. In Hive, you can compact small files with the following SQL commands:

SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 256000000;
SET hive.merge.smallfiles.avgsize = 134217728;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
INSERT OVERWRITE TABLE db_name.table_name SELECT * FROM db_name.table_name;
Run <Refresh Table> in impala after the Hive job finishes.

For tables with many partitions, change your partitioning strategy to partition in a less granular way. For example, partition by year/month instead of by year/month/day. If you are doing inserts with Impala, use /* +SHUFFLE */ optimizer hints, which add an exchange node before writing the data. Using this hint, only one node writes to a partition at a time, reducing the number of files written. See Optimizer Hints in Impala for more information.