Identifying File Size Storage Issues

Data stored in small files or partitions may create performance issues. The File size reporting feature helps you identify data that is stored inefficiently in small files or partitions.

A table's data maybe stored in a large number of files, perhaps millions of files. For example, the first time you run an Impala query, Impala also loads the metadata for each file, which can cause processing delays. In addition, every time you change a query, refresh the metadata, or add a new file or partition, Impala reloads the metadata. This puts pressure on the NameNode, which stores each file's metadata. For more information about the problems caused by small files and what you can do to fix those problems, see Handling Small Files on Hadoop with Hive and Impala on the Cloudera Engineering Blog.

The Workload XM file size reporting enables you to identify tables that have a large number of files or partitions. For example, if a query runs slowly or your Impala cluster crashed, you can view the table's metadata that the query is using to determine if a large number of files or partitions is causing the problem.