Scaling Namespaces and Optimizing Data Storage
Also available as:
PDF
loading table of contents...

Properties for configuring the Balancer

Depending on your requirements, you can configure various properties for the HDFS Balancer.

dfs.datanode.balance.max.concurrent.moves

Limits the maximum number of concurrent block moves that a Datanode is allowed for balancing the cluster. If you set this configuration in a Datanode, the Datanode throws an exception when the limit is exceeded. If you set this configuration in the HDFS Balancer, the HDFS Balancer schedules concurrent block movements within the specified limit. The Datanode setting and the HDFS Balancer setting can be different. As both settings impose a restriction, an effective setting is the minimum of them.

It is recommended that you set this to the highest possible value in Datanodes and adjust the runtime value in the HDFS Balancer to gain the flexibility. The default value is 5.

You can reconfigure without Datanode restart. Follow these steps to reconfigure a Datanode:

  1. Change the value of dfs.datanode.balance.max.concurrent.moves in the configuration xml file on the Datanode machine.

  2. Start a reconfiguration task. Use the hdfs dfsadmin -reconfig datanode <dn_addr>:<ipc_port> start command.

For example, suppose a Datanode has 12 disks. You can set the configuration to 24, a small multiple of the number of disks, in the Datanodes. Setting it to a higher value might not be useful, and only increases disk contention. If the HDFS Balancer is running in a maintenance window, the setting in the HDFS Balancer can be the same, that is, 24, to use all the bandwidth. However, if the HDFS Balancer is running at same time as other jobs, you set it to a smaller value, for example, 5, in the HDFS Balancer so that there is bandwidth available for the other jobs.

dfs.datanode.balance.bandwidthPerSec

Limits the bandwidth in each Datanode using for balancing the cluster. Changing this configuration does not require restarting Datanodes. Use the dfsadmin -setBalancerBandwidth command.

The default is 1048576 (=1MB/s).

dfs.balancer.moverThreads

Limits the number of total concurrent moves for balancing in the entire cluster. Set this property to the number of threads in the HDFS Balancer for moving blocks. Each block move requires a thread.

The default is 1000.

dfs.balancer.max-size-to-move

With each iteration, the HDFS Balancer chooses DataNodes in pairs and moves data between the DataNode pairs. Limits the maximum size of data that the HDFS Balancer moves between a chosen DataNode pair. If you increase this configuration when the network and disk are not saturated, increases the data transfer between the DataNode pair in each iteration while the duration of an iteration remains about the same.

The default is 10737418240 (10GB).

dfs.balancer.getBlocks.size

Specifies the total data size of the block list returned by a getBlocks(..).

When the HDFS Balancer moves a certain amount of data between source and destination DataNodes, it repeatedly invokes the getBlocks(..) rpc to the Namenode to get lists of blocks from the source DataNode until the required amount of data is scheduled.

The default is 2147483648 (2GB).

dfs.balancer.getBlocks.min-block-size

Specifies the minimum block size for the blocks used to balance the cluster.

The default is 10485760 (10MB)

dfs.datanode.block-pinning.enabled

Specifies if block-pinning is enabled. When you create a file, a user application can specify a list of favorable DataNodes by way of the file creation API in DistributedFileSystem. The NameNode uses its best effort, allocating blocks to the favorable DataNodes. If dfs.datanode.block-pinning.enabled is set to true, if a block replica is written to a favorable DataNode, it is “pinned” to that DataNode. The pinned replicas are not moved for cluster balancing to keep them stored in the specified favorable DataNodes. This feature is useful for block distribution aware user applications such as HBase.

The default is false.