Detecting slow DataNodes

Slow DataNodes in a CDP Private Cloud Base cluster can negatively impact the cluster performance. Therefore, HDFS provides a mechanism to detect and report slow DataNodes that have a negative impact on the performance of the cluster.

HDFS is designed to detect and recover from complete failure of DataNodes:

  • There is no single point of failure.

  • Automatic NameNode failover takes only a few seconds.

  • Because data replication can be massively parallelized in large clusters, recovery from DataNode loss occurs within minutes.

  • Most jobs are not affected by DataNode failures.

However, partial failures can negatively affect the performance of running DataNodes:

  • Slow network connection due to a failing or misconfigured adapter.

  • Bad OS or JVM settings that affect service performance.

  • Slow hard disk.

  • Bad disk controller.

Slow DataNodes can have a significant impact on cluster performance. A slow DataNode may continue sending heartbeats successfully, and the NameNode will keep redirecting clients to slow DataNodes. HDFS DataNode monitoring provides detection and reporting of slow DataNodes that negatively affect cluster performance.