Tuning and Troubleshooting Host Decommissioning

Decommissioning a host decommissions and stops all roles on the host without requiring you to individually decommission the roles on each service. The decommissioning process can take a long time and uses a great deal of cluster resources, including network bandwidth. You can tune the decommissioning process to improve performance and mitigate the performance impact on the cluster.

You can use the Decommission and Recommission features to perform minor maintenance on cluster hosts using Cloudera Manager to manage the process. See Performing Maintenance on a Cluster Host.

Tuning HDFS Prior to Decommissioning DataNodes

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

When a DataNode is decommissioned, the NameNode ensures that every block from the DataNode will still be available across the cluster as dictated by the replication factor. This procedure involves copying blocks from the DataNode in small batches. If a DataNode has thousands of blocks, decommissioning can take several hours. Before decommissioning hosts with DataNodes, you should first tune HDFS:

  1. Run the following command to identify any problems in the HDFS file system:
    hdfs fsck / -list-corruptfileblocks -openforwrite -files -blocks -locations 2>&1 > /tmp/hdfs-fsck.txt  
  2. Fix any issues reported by the fsck command. If the command output lists corrupted files, use the fsck command to move them to the lost+found directory or delete them:
    hdfs fsck file_name -move
    or
    hdfs fsck file_name -delete
  3. Raise the heap size of the DataNodes. DataNodes should be configured with at least 4 GB heap size to allow for the increase in iterations and max streams.
    1. Go to the HDFS service page.
    2. Click the Configuration tab.
    3. Select Scope > DataNode.
    4. Select Category > Resource Management.
    5. Set the Java Heap Size of DataNode in Bytes property as recommended.

      To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.

    6. Click Save Changes to commit the changes.
  4. Increase the replication work multiplier per iteration to a larger number (the default is 2, however 10 is recommended):
    1. Select Scope > NameNode.
    2. Expand the Category > Advanced category.
    3. Configure the Replication Work Multiplier Per Iteration property to a value such as 10.

      To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.

    4. Click Save Changes to commit the changes.
  5. Increase the replication maximum threads and maximum replication thread hard limits:
    1. Select Scope > NameNode.
    2. Expand the Category > Advanced category.
    3. Configure the Maximum number of replication threads on a DataNode and Hard limit on the number of replication threads on a DataNode properties to 50 and 100 respectively. You can decrease the number of threads (or use the default values) to minimize the impact of decommissioning on the cluster, but the trade off is that decommissioning will take longer.

      To apply this configuration property to other role groups as needed, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.

    4. Click Save Changes to commit the changes.
  6. Restart the HDFS service.

For additional tuning recommendations, see Performance Considerations.

Tuning HBase Prior to Decommissioning DataNodes

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

To increase the speed of a rolling restart of the HBase service, set the Region Mover Threads property to a higher value. This increases the number of regions that can be moved in parallel, but places additional strain on the HMaster. In most cases, Region Mover Threads should be set to 5 or lower.

Performance Considerations

Decommissioning a DataNode does not happen instantly because the process requires replication of a potentially large number of blocks. During decommissioning, the performance of your cluster may be impacted. This section describes the decommissioning process and suggests solutions for several common performance issues.

Decommissioning occurs in two steps:
  1. The Commission State of the DataNode is marked as Decommissioning and the data is replicated from this node to other available nodes. Until all blocks are replicated, the node remains in a Decommissioning state. You can view this state from the NameNode Web UI. (Go to the HDFS service and select Web UI > NameNode Web UI.)
  2. When all data blocks are replicated to other nodes, the node is marked as Decommissioned.

Decommissioning can impact performance in the following ways:

  • There must be enough disk space on the other active DataNodes for the data to be replicated. After decommissioning, the remaining active DataNodes have more blocks and therefore decommissioning these DataNodes in the future may take more time.
  • There will be increased network traffic and disk I/O while the data blocks are replicated.
  • Data balance and data locality can be affected, which can lead to a decrease in performance of any running or submitted jobs.
  • Decommissioning a large numbers of DataNodes at the same time can decrease performance.
  • If you are decommissioning a minority of the DataNodes, the speed of data reads from these nodes limits the performance of decommissioning because decommissioning maxes out network bandwidth when reading data blocks from the DataNode and spreads the bandwidth used to replicate the blocks among other DataNodes in the cluster. To avoid performance impacts in the cluster, Cloudera recommends that you only decommission a minority of the DataNodes at the same time.
  • You can decrease the number of replication threads to decrease the performance impact of the replications, but this will cause the decommissioning process to take longer to complete. See Tuning HDFS Prior to Decommissioning DataNodes.

Cloudera recommends that you add DataNodes and decommission DataNodes in parallel, in smaller groups. For example, if the replication factor is 3, then you should add two DataNodes and decommission two DataNodes at the same time.

Troubleshooting Performance of Decommissioning

The following conditions can also impact performance when decommissioning DataNodes:
Open Files
Write operations on the DataNode do not involve the NameNode. If there are blocks associated with open files located on a DataNode, they are not relocated until the file is closed. This commonly occurs with:
  • Clusters using HBase
  • Open Flume files
  • Long running tasks
To find and close open files:
  1. Log in to the NameNode host and switch to the log directory.
    The location of this directory is configurable using the NameNode Log Directory property. By default, this directory is located at:
    /var/log/hadoop-hdfs/
  2. Run the following command to verify that the logs provide the needed information:
    grep "Is current datanode" *NAME* | head
    The sixth column of the log file shows the block id, and the message should be relevant to the DataNode decommissioning. Run the following command to view the relevant log entries:
    grep "Is current datanode" *NAME* | awk '{print $6}' | sort -u > blocks.open
  3. Run the following command to return a list of open files, their blocks, and the locations of those blocks:
    hadoop fsck / -files -blocks -locations -openforwrite 2>&1 > openfiles.out
  4. Look in the openfiles.out file created by the command for the blocks showing blocks.open. Also verify that the DataNode IP address is correct.
  5. Using the list of open file(s), perform the appropriate action to restart process to close the file.

    For example, major compaction closes all files in a region for HBase.

A block cannot be relocated because there are not enough DataNodes to satisfy the block placement policy.
For example, for a 10 node cluster, if the mapred.submit.replication is set to the default of 10 while attempting to decommission one  DataNode, there will be difficulties relocating blocks that are associated with map/reduce jobs.    This condition will lead to errors in the NameNode logs similar to the following:   
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault: Not able to place enough replicas, still in need of 3 to reach 3   
Use the following steps to find the number of files where the block replication policy is equal to or above your current cluster size:
  1. Provide a listing of open files, their blocks, the locations of those blocks by running the following command:
    hadoop fsck / -files -blocks -locations -openforwrite 2>&1 > openfiles.out
  2. Run the following command to return a list of how many files have a given replication factor:
    grep repl= openfiles.out | awk '{print $NF}' | sort | uniq -c
    For example, when the replication factor is 10 , and decommissioning one:
    egrep -B4 "repl=10" openfiles.out | grep -v '<dir>' | awk '/^\//{print $1}'
  3. Examine the paths, and decide whether to reduce the replication factor of the files, or remove them from the cluster.