Using the HDFS Balancer
The HDFS balancer re-balances data across the DataNodes, moving blocks from over-utilized to under-utilized nodes. As the system administrator, you can run the balancer from the command-line as necessary -- for example, after adding new DataNodes to the cluster.
Points to note:
- The balancer requires the capabilities of an HDFS superuser (for example, the hdfs user) to run.
- The balancer does not balance between individual volumes on a single DataNode.
- You can run the balancer without parameters, as follows:
sudo -u hdfs hdfs balancer
This runs the balancer with a default threshold of 10%, meaning that the script will ensure that disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%. For example, if overall usage across all the DataNodes in the cluster is 40% of the cluster's total disk-storage capacity, the script ensures that each DataNode's disk usage is between 30% and 50% of that DataNode's disk-storage capacity. - You can run the script with a different threshold; for example:
sudo -u hdfs hdfs balancer -threshold 5
This specifies that each DataNode's disk usage must be (or will be adjusted to be) within 5% of the cluster's overall usage. - You can adjust the network bandwidth used by the balancer, by running the dfsadmin -setBalancerBandwidth command before you run the balancer; for
example:
dfsadmin -setBalancerBandwidth newbandwidth
where newbandwidth is the maximum amount of network bandwidth, in bytes per second, that each DataNode can use during the balancing operation. For more information about the bandwidth command, see this page. - The balancer can take a long time to run, especially if you are running it for the first time, or do not run it regularly.