Avoiding small files
To reduce the amount of memory used by the Catalog for metadata, avoid creating many small files in HDFS.
By: Manish Maheshwari, Data Architect and Data Scientist at Cloudera, Inc.
Small files in HDFS can be caused by either having partitions that are too granular or by performing data ingestion too frequently. Cloudera recommends that you regularly compact small files. In Hive, you can compact small files with the following SQL commands:
SET hive.merge.mapfiles = true;
SET hive.merge.mapredfiles = true;
SET hive.merge.size.per.task = 256000000;
SET hive.merge.smallfiles.avgsize = 134217728;
SET hive.exec.compress.output = true;
SET parquet.compression = snappy;
INSERT OVERWRITE TABLE db_name.table_name SELECT * FROM db_name.table_name;
Run <Refresh Table> in impala after the Hive job finishes.
For tables with many partitions, change your partitioning strategy to
partition in a less granular way. For example, partition by year/month
instead of by year/month/day
. If you are doing inserts
with Impala, use /* +SHUFFLE */
optimizer hints, which add an exchange node before writing the data. Using
this hint, only one node writes to a partition at a time, reducing the
number of files written. See Optimizer Hints in Impala for more
information.