File Size Reporting

File size reporting helps you identify data that is stored inefficiently, in small files or partitions. If a query is running slowly or your Impala cluster crashed, you can look at file size metadata in WXM to determine if a large number of files or partitions is causing the problem.

For example, you might be unaware that a table has a large number of files, perhaps even millions of files. The first time you run an Impala query, Impala must load the metadata for every file, which can cause the query to take a long time. In addition, every time you change the query, refresh the metadata, or if you add a new file or partition, Impala must load the metadata again. This also puts pressure on the NameNode because the NameNode stores the metadata as well. For more information about the problems caused by small files and what you can do to fix those problems, see Handling Small Files on Hadoop with Hive and Impala on the Cloduera Engineering Blog.

You can use WXM file size reporting to identify tables that have a large number of files or partitions. If a particular query is running slowly, you can look at the table metadata for the HDFS tables that were scanned in the query to see if they have a large number of files or partitions.

Before you can view file size metadata in WXM, you must enable file size reporting in Cloudera Manager. See Enabling File Size Reporting for information about how to enable file size reporting. After you've enabled file size reporting, Navigator collects file size metadata and saves it in HDFS. Telemetry Publisher gets the metadata from HDFS and sends it to WXM.

Viewing File Size Metadata

To view file size metadata, open the File Size Report page under the Data Warehouse heading on the left side of the WXM UI.

In the File Size Report page, you can search for a table, or you can sort the tables by the number of files, the number of partitions, or the table size. In the image below, the Animantarx table has 7 million files and 913 partitions.

You can click on a table name to open a pop-up window that contains file size distribution information about the table. In the image below, the Aerosteon table has 42 files that range from 10 to 24.5 GiB. The graph also shows the distribution of the Q1 and Q3 files.

Troubleshooting Slow Queries

If you are aware of a specific query that is running slowly, you can open the query in the Summary page under the Data Warehouse heading and go to the HDFS Tables Scanned tab.

For example, in the image below, you can see that the query took over six hours to run. The HDFS Tables Scanned tab shows the metadata for the tables that were scanned. Note that this is not the number of files accessed, but the total number of files that were in the table the last time an HDFS snapshot was taken before the query was run. You can click on the table name to view file size distribution information for the table.