File Size Reporting

File size reporting helps you identify data that is stored inefficiently, in small files or partitions. If a query is running slowly or your Impala cluster crashed, you can look at file size metadata in Workload Manager to determine if a large number of files or partitions is causing the problem.

For example, you might be unaware that a table has a large number of files, perhaps even millions of files. The first time you run an Impala query, Impala must load the metadata for every file, which can cause the query to take a long time. In addition, every time you change the query, refresh the metadata, or if you add a new file or partition, Impala must load the metadata again. This also puts pressure on the NameNode because the NameNode stores the metadata as well. For more information about the problems caused by small files and what you can do to fix those problems, see Handling Small Files on Hadoop with Hive and Impala on the Cloduera Engineering Blog.

You can use Workload Manager file size reporting to identify tables that have a large number of files or partitions. If a particular query is running slowly, you can look at the table metadata for the HDFS tables that were scanned in the query to see if they have a large number of files or partitions.

Before you can view file size metadata in Workload Manager, you must enable file size reporting in Cloudera Manager. After you've enabled file size reporting, Navigator collects file size metadata and saves it in HDFS. Telemetry Publisher gets the metadata from HDFS and sends it to Workload Manager.

Viewing File Size Metadata

To view file size metadata, open the File Size Report page under the Data Warehouse heading on the left side of the Workload Manager UI.

In the File Size Report page, you can search for a table, or you can sort the tables by the number of files, the number of partitions, or the table size. You can click on a table name to open a pop-up window that contains file size distribution information about the table.

Troubleshooting Slow Queries

If you are aware of a specific query that is running slowly, you can open the query in the Summary page under the Data Warehouse heading and go to the HDFS Tables Scanned tab.

For example, in the image below, you can see that the query took over six hours to run. The HDFS Tables Scanned tab shows the metadata for the tables that were scanned. Note that this is not the number of files accessed, but the total number of files that were in the table the last time an HDFS snapshot was taken before the query was run. You can click on the table name to view file size distribution information for the table.