Disk Balancer commands

In addition to planning for data movement across disks and executing the plan, you can use hdfs diskbalancer sub-commands to query the status of the plan, cancel the plan, identify at a cluster level the DataNodes that require balancing, or generate a detailed report on a specific DataNode that can benefit from running the Disk Balancer.

Planning the data movement for a DataNode

Command:hdfs diskbalancer -plan <datanode>
Argument Description
<datanode> Fully qualified name of the DataNode for which you want to generate the plan.
hdfs diskbalancer -plan node1.mycluster.com
The following table lists the additional options that you can use with the hdfs diskbalancer -plan command:
Option Description
-out Specify the location within the HDFS namespace where you want to save the output JSON documents that contain the generated plans.
-bandwidth Specify the maximum bandwidth to use for running the Disk Balancer. This option helps in minimizing the amount of data moved by the Disk Balancer on an operational DataNode.

Disk Balancer uses the default bandwidth of 10 MB/s if you do not specify this value.

-thresholdPercentage

The ideal storage value for a set of disks in a DataNode indicates the amount of data each disk should have for achieving perfect data distribution across those disks. The threshold percentage defines the value at which disks start participating in data redistribution or balancing operations. Minor imbalances are ignored because normal operations automatically correct some of these imbalances.

The default value of -thresholdPercentage for a disk is 10%; indicating that a disk is used in balancing operations only if the disk contains 10% more or less data than the ideal storage value.

-maxerror Specify the number of errors to ignore for a move operation between two disks before abandoning the move.

Disk Balancer uses the default if you do not specify this value.

-v Verbose mode. Specify this option for Disk Balancer to display a summary of the plan as output.
-fs Specify the NameNode to use.

Disk Balancer uses the default NameNode from the configuration if you do not specify this value.

Executing the plan

Command:hdfs diskbalancer -execute <JSON file path>
Argument Description
<JSON file path> Path to the JSON document that contains the generated plan (nodename.plan.json).
hdfs diskbalancer -execute /system/diskbalancer/nodename.plan.json

Querying the current status of execution

Command:hdfs diskbalancer -query <datanode>
Argument Description
<datanode> Fully qualified name of the DataNode for which the plan is running.
hdfs diskbalancer -query nodename.mycluster.com

Cancelling a running plan

Commands:
hdfs diskbalancer -cancel <JSON file path>
Argument Description
<JSON file path> Path to the JSON document that contains the generated plan (nodename.plan.json).
hdfs diskbalancer -cancel /system/diskbalancer/nodename.plan.json

OR

hdfs diskbalancer -cancel <planID> -node <nodename>
Argument Description
planID ID of the plan to cancel.

You can get this value from the output of the hdfs diskbalancer -query command.

nodename The fully qualified name of the DataNode on which the plan is running.

Viewing detailed report of DataNodes that require Disk Balancer

Commands:
hdfs diskbalancer -fs http://namenode.uri -report -node <file://>
Argument Description
<file://> Hosts file listing the DataNodes for which you want to generate the reports.

OR

hdfs diskbalancer -fs http://namenode.uri -report -node [<DataNodeID|IP|Hostname>,...]
Argument Description
[<DataNodeID|IP|Hostname>,...] Specify the DataNode ID, IP address, and the host name of the DataNode for which you want to generate the report. For multiple DataNodes, provide the details using comma-separated values.

Viewing details of the top DataNodes in a cluster that require Disk Balancer

Command:hdfs diskbalancer -fs http://namenode.uri -report-node -top <topnum>
Argument Description
<topnum> The number of the top DataNodes that require Disk Balancer to be run.