Chapter 12. HDFS DataNode Monitoring

The Hadoop Distributed File System (HDFS) is designed to detect and recover from complete failure of DataNodes:

There is no single point of failure.
Automatic NameNode failover takes only a few seconds.
Because data replication can be massively parallelized in large clusters, recovery from DataNode loss occurs within minutes.
Most jobs are not affected by DataNode failures.

However, partial failures can negatively affect the performance of running DataNodes:

Slow network connection due to a failing or misconfigured adapter.
Bad OS or JVM settings that affect service performance.
Slow hard disk.
Bad disk controller.

Slow DataNodes can have a significant impact on cluster performance. A slow DataNode may continue heartbeating successfully, and the NameNode will keep redirecting clients to slow DataNodes. HDFS DataNode Monitoring provides detection and reporting of slow DataNodes that negatively affect cluster performance.

Disk IO Statistics

Disk IO statistics are disabled by default. To enable disk IO statistics, set the file IO sampling fraction to a non-zero value in the hdfs-site.xml file:

 <property>
  <name>dfs.datanode.fileio.profiling.sampling.fraction</name>
  <value>1</value>
 </property>

Setting this value to 1.0 samples 100% of disk IO; a value of 0.5 samples 50% of disk IO, and so on. Sampling disk IO may have a small impact on cluster performance.

You can access the disk IO statistics via the NameNode JMX page at http://<namenode_host>:50070/jmx. In the following JMX output example, the time unit is milliseconds, and the disk is healthy because the IO latencies are low:

    "name" : "Hadoop:service=DataNode,name=DataNodeVolume-/data/disk2/dfs/data/",
    "modelerType" : "DataNodeVolume-/data/disk2/dfs/data/",
    "tag.Context" : "dfs",
    "tag.Hostname" : "n001.hdfs.example.com",
    "TotalMetadataOperations" : 67,
    "MetadataOperationRateAvgTime" : 0.08955223880597014,
  ...
    "WriteIoRateNumOps" : 7321,
    "WriteIoRateAvgTime" : 0.050812730501297636

Slow DataNode Detection

When slow DataNode detection is enabled, DataNodes collect latency statistics on their peers during Write pipelines, and periodic outlier detection is used to determine slow peers. The NameNode aggregates reports from all DataNodes and flags potentially slow nodes. Slow DataNode detection is disabled by default. To enable slow DataNode detection, set the following property in the hdfs-site.xml file:

 <property>
  <name>dfs.datanode.peer.stats.enabled</name>
  <value>true</value>
 </property>

You can access the slow DataNode statistics via the NameNode JMX page at http://<namenode_host>:50070/jmx. In the following JMX output example, the time unit is milliseconds, and the peer DataNodes are healthy because the latencies are in milliseconds:

"name" : "Hadoop:service=DataNode,name=DataNodeInfo",
"modelerType" : "org.apache.hadoop.hdfs.server.datanode.DataNode",    "SendPacketDownstreamAvgInfo" : "{
        \"[192.168.7.202:50075]RollingAvgTime\" : 1.4476967370441458,
        \"[192.168.7.201:50075]RollingAvgTime\" : 1.5569170444798432
}"

You can also access slow DataNode statistics via the DataNode JMX page at http://<datanode_host>:50075/jmx.