User Guide
Also available as:
PDF

HDFS Dashboard

The HDFS Dashboard helps operators better understand how HDFS is being used and which users and jobs are consuming the most resources within the file system.

This dashboard includes the following paragraphs:

  • File Size Distribution

  • Top N Users with Small Files

  • Top N Largest HDFS Users

  • Average File Size

  • HDFS File Size Distribution Trend

  • HDFS Utilization Trend

  • HDFS File Size Distribution Trend by User

  • HDFS File Size Distribution Trend by User

  • Jobs With High Number of HDFS Operations

  • HDP 2.5: Jobs Creating Many HDFS Files

  • Jobs With Large Amount of Data Written

Most of these paragraphs have titles that are self-explanatory. A few of them are described below to provide more context:

ParagraphDescription
File Size Distribution

For any large multi-tenant cluster, it’s important to identify and keep the proliferation of small files in check. The paragraph displays a pie chart showing the relative distribution of files by file size categorized by Tiny (0-10K), Mini (10K-1M), Medium (30M-128M), and Large (128M+) files.

The goal is to show how dominant specific file size categories are within HDFS. If there are many small files, you can easily identify (in the next paragraph) who is contributing to those small files.

Top N Users with Small Files

Understanding how prevalent files of specific sizes are is helpful, but the next step is understanding who is responsible for creating those files. The goal of this paragraph is to show who is responsible for creating the majority of small files within HDFS.

Top N Largest HDFS Users

This paragraph helps you understand where all of the HDFS capacity is being consumed, and who is consuming it. The goal is to help you quickly understand which user or users are storing the most data in HDFS.

HDFS File Size Distribution Trend by User

Each “by User” paragraph allows you to see how an individual user’s file sizes are trending.

This paragraph helps answer questions related to points in time where large or small files start becoming more or less prevalent for specific users, and can help measure the success of coaching users on Hadoop best practices.

HDP 2.5: Jobs Creating Many HDFS Files

When troubleshooting issues related to HDFS NameNode performance, it’s helpful to understand which jobs are creating the most files, and potentially putting the largest amount of load on the NameNode.

In HDP 2.5, new counters have been added to track how many files are created by each YARN application. This is helpful in troubleshooting erroneous jobs that are unintentionally creating hundreds of thousands, or even millions of files within HDFS.