HDFS Balancers
HDFS data might not always be distributed uniformly across DataNodes. One common reason is addition of new DataNodes to an existing cluster. HDFS provides a balancer utility that analyzes block placement and balances data across the DataNodes. The balancer moves blocks until the cluster is deemed to be balanced, which means that the utilization of every DataNode (ratio of used space on the node to total capacity of the node) differs from the utilization of the cluster (ratio of used space on the cluster to total capacity of the cluster) by no more than a given threshold percentage. The balancer does not balance between individual volumes on a single DataNode.
Configuring and Running the HDFS Balancer Using Cloudera Manager
Minimum Required Role: Cluster Administrator (also provided by Full Administrator)
In Cloudera Manager, the HDFS balancer utility is implemented by the Balancer role. The Balancer role usually shows a health of None on the HDFS Instances tab because it does not run continuously.
The Balancer role is normally added (by default) when the HDFS service is installed. If it has not been added, you must add a Balancer role in order to rebalance HDFS and to see the Rebalance action.
Configuring the Balancer Threshold
The Balancer has a default threshold of 10%, which ensures that disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%. For example, if overall usage across all the DataNodes in the cluster is 40% of the cluster's total disk-storage capacity, the script ensures that DataNode disk usage is between 30% and 50% of the DataNode disk-storage capacity. To change the threshold:- Go to the HDFS service.
- Click the Configuration tab.
- Select .
- Select .
- Set the Rebalancing Threshold property.
If more than one role group applies to this configuration, edit the value for the appropriate role group. See Modifying Configuration Properties Using Cloudera Manager.
- Click Save Changes to commit the changes.
Configuring and Running the HDFS Balancer Using the Command Line
The HDFS balancer re-balances data across the DataNodes, moving blocks from overutilized to underutilized nodes. As the system administrator, you can run the balancer from the command-line as necessary -- for example, after adding new DataNodes to the cluster.
- The balancer requires the capabilities of an HDFS superuser (for example, the hdfs user) to run.
- The balancer does not balance between individual volumes on a single DataNode.
- You can run the balancer without parameters, as follows:
sudo -u hdfs hdfs balancer
This runs the balancer with a default threshold of 10%, meaning that the script will ensure that disk usage on each DataNode differs from the overall usage in the cluster by no more than 10%. For example, if overall usage across all the DataNodes in the cluster is 40% of the cluster's total disk-storage capacity, the script ensures that each DataNode's disk usage is between 30% and 50% of that DataNode's disk-storage capacity. - You can run the script with a different threshold; for example:
sudo -u hdfs hdfs balancer -threshold 5
This specifies that each DataNode's disk usage must be (or will be adjusted to be) within 5% of the cluster's overall usage. - You can adjust the network bandwidth used by the balancer, by running the dfsadmin -setBalancerBandwidth command before you run the balancer; for
example:
dfsadmin -setBalancerBandwidth newbandwidth
where newbandwidth is the maximum amount of network bandwidth, in bytes per second, that each DataNode can use during the balancing operation. For more information about the bandwidth command, see BalancerBandwidthCommand. - The balancer can take a long time to run, especially if you are running it for the first time, or do not run it regularly.