Using the HBCK2 Tool to Remediate HBase Clusters

The HBCK2 tool is a repair tool to remediate Apache HBase clusters in CDH. The HBCK2 tool is the next version of the Apache HBase hbck tool.

To identify a list of inconsistencies or blockages in a running HBase cluster, you can view or search the logs using the log search feature in Cloudera Manager. Once you have identified the issue, you can then use the HBCK2 tool to fix the defect or to skip-over a bad state. The HBCK2 tool uses an interactive fix-it process by asking the Master to make the fixes rather than carry out the repair locally.

The HBCK2 performs a single, discrete task each time it is run. The HBCK2 tool does not analyze everything in a running cluster and repair all the problems. Instead, you can use the HBCK2 tool to iteratively find and fix issues in your cluster. The HBCK2 tool lets you use interactive commands to fix one issue at a time.

Supported Versions

You can use the HBCK2 tool with these versions of CDH:

  • CDH 6.1.x

  • CDH 6.2.x

  • CDH 6.3.x and later

Running the HBCK2 Tool

The HBCK2 tool is a part of the hbase-operator-tools binary. Once you get the hbase-operator-tools binary from Cloudera, upload the binary tarball to the target cluster and extract the tarball. The HBCK2 JAR file is contained in the operator tools tarball provided by Cloudera Support at hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar.

You can run the HBCK2 tool by specifying the JAR path with the “-j” option as shown here:

$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar

When you run the command, the HBCK2 tool command-line menu appears.

As a Cloudera Support or Professional Services personnel using this tool to remediate an HBase cluster, gather useful information using these commands as an HBase super user (typically, hbase), or an HBase principal if Kerberos is enabled:
$ hdfs dfs -ls -R /hbase/ 2>&1 | tee /tmp/hdfs-ls.txt
$ hbase hbck -details 2>&1 | tee /tmp/hbase-hbck.txt
$ echo "scan 'hbase:meta'" | hbase shell 2>&1 | tee /tmp/hbase-meta.txt

Finding Issues

The HBCK2 tool enables you to use interactive commands to fix one issue at a time. If you have multiple issues, you may have to run the tool iteratively to find and resolve all the issues. Use the following utilities and commands to find the issues.

Find issues using diagnostic tools

Master logs

The Apache HBase Master runs all the cluster start and stop operations, RegionServer assignment, and server crash handling. Everything that the Master does is a procedure on a state machine engine and each procedure has an unique procedure ID (PID). You can trace the lifecycle of a procedure by tracking its PID through the entries in the Master log. Some procedures may spawn sub-procedures and wait for the sub-procedure to complete.

You can trace the sub-procedure by tracking its PID and the parent PID (PPID).

If there is a problem with RegionServer assignment, the Master prints a STUCK log entry similar to the following:
2018-09-12 15:29:06,558 WARN
org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK
Region-In-Transition rit=OPENING, location=va1001.example.org,00001,1000173230599, 
table=IntegrationTestBigLinkedList_20180626110336, 
region=dbdb56242f17610c46ea044f7a42895b

Master user interface

Status tables

You can find issues in your HBase tables by looking at the status tables section in the Master user interface home page. Look through the list of tables to identify if a table is ENABLED, ENABLING, DISABLED, or DISABLING. You can also take a look at the regions in transition states: OPEN, CLOSED. For example, there may be an issue if a table is ENABLED, some regions are not in the OPEN state, and the Master log entries do not have any ongoing assignments.

Procedures and locks

When an Apache HBase cluster is started, the Procedures & Locks page in the Master user interface is populated with information about the procedures, locks, and the count of WAL files. After the cluster settles, if the WAL file count does not reduce, it leads to procedure blocks. You can identify those procedures and locks on this page.

You can also get a list of locks and procedures using this command in the HBase shell:
$ echo "list_locks"| hbase shell &> /tmp/locks.txt
$ echo "list_procedures"| hbase shell &> /tmp/procedures.txt

Apache HBase canary tool

Use the HBase canary tool to verify the state of the assigns in your cluster. You can run this tool to focus on just one table or the entire cluster. You can check the cluster assign using this command:
$ hbase canary -f false -t 6000000 &>/tmp/canary.log

Use the -f parameter to look for failed region fetches, and set the -t parameter to run for a specified time.

Fixing Issues

You must keep these in mind when fixing issues using HBCK2. Ensure that:

  • A region is not in the CLOSING state during “assign”, and in the OPENING state during “unassign”. You can change the state using the setRegionState command. See the HBCK2 tool Command Reference section for more information.
  • You fix only one table at a time.

Fix assign and unassign issues

You can fix assign and unassign issues by monitoring the current list of outstanding locks. An assign against a locked region will wait till the lock is released. An assignment gets an exclusive lock on the region.

Fix master startup cannot progress error

If you see a master startup cannot progress holding-pattern until region online error in the Master log, it means that the Master is unable to start because there is no procedure to assign hbase:meta. You will see an error message similar to this:
2020-04-01 22:07:42,792 WARN org.apache.hadoop.hbase.master.HMaster:
 hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=CLOSING, 
ts=1538456302300, server=ve1017.example.org,22101,1234567891012}; 
ServerCrashProcedures=true. Master startup cannot progress in holding-pattern until region onlined.
To fix this issue, run the following command:
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar 
assigns 1588230740
The same issue can occur with a hbase:namespace system table. To fix this issue, run the following command:
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar 
assigns <hbase:namespace encoded region id>
You can find the namespace encoded region id using this command:
$ echo "scan 'hbase:meta',{COLUMNS=>'info:regioninfo', 
FILTER=>\"PrefixFilter('hbase:namespace')\"}" | hbase shell

The namespace encoded region id is the value under the "ENCODED" field in the results.

Fix missing regions in hbase:meta region/table

If you encounter an issue where table regions have been removed from the hbase:meta table, you can use the addFsRegionsMissingInMeta to resolve this issue. Ensure that the Master is online. This command is not as disruptive as the hbase:meta rebuild command.

To fix this issue, run this command:
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar 
addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>

The command returns an HBCK2 “assigns” command with all the listed re-inserted regions.You must restart the Master, and then run the HBCK2 'assigns' command returned by the addFsRegionsMissingInMeta command to complete your fix.

Example output:
Regions re-added into Meta: 2
WARNING:
2 regions were added to META, but these are not yet on Masters cache.
You need to restart Masters, then run hbck2 'assigns' command below:
assigns 7be03127c5e0e2acfc7cae7ddfa9e29e e50b8c1adc38c942e226a8b2976f0c8c

Fix extra regions in hbase:meta region/table

If there are extra regions in hbase:meta, it may be because of problems in splitting, deleting/moving the region directory manually, or in rare cases because of the loss of metadata.

To fix this issue, run this command:
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar 
extraRegionsInMeta --fix  <NAMESPACE|NAMESPACE:TABLENAME>...

Rebuild hbase:meta

If hbase:meta is offline because it is corrupted, you can bring it back online if the corruption is not too critical. If the namespace region is among the mission regions, scan hbase:meta during initialization to check if hbase:meta is online.

To check if hbase:meta is online, run this command in the Apache HBase shell:
$ echo "scan 'hbase:meta', {COLUMN=>'info:regioninfo'}" | hbase shell
If this scan does not throw any errors, then you can run the following command to validate that the tables are present:
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar 
addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>

The addFsRegionsMissingInMeta command adds regions back to the hbase:meta table if the regioninfo file is present in the storage but the regions were deleted because of an issue.

Fix dropped references and corrupted HFiles

To fix hanging references and corrupt HFiles, run the following command:
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar 
filesystem --fix [<TABLENAME>...]

HBCK2 Tool Command Reference

  • addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>...

    Options: -d,--force_disable Use this option to abort fix for table if disable fails.

    Supported from CDH 6.1.0 and later.

  • assigns [OPTIONS] <ENCODED_REGIONNAME>...

    Options:-o,--override Use this option to override ownership by another procedure.

    Supported in CDH 6.1.0 and later.

  • bypass [OPTIONS] <PID>...

    Options: -o,--override Use this option to override if procedure is running/stuck -r,--recursive Use this option to bypass parent and its children.

    -w,--lockWait Use this option to wait (in milliseconds) before giving up; default=1.

    Supported in CDH 6.1.0 and later.

  • extraRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
    

    Options:-f, --fix Use this option to fix meta by removing all extra regions found.

    Supported from CDH 6.1.0 and later.

  • filesystem [OPTIONS] [<TABLENAME>...]

    Options:-f, --fix Use this option to sideline corrupt HFiles, bad links, and references.

    Supported in CDH 6.1.0 and later.

  • replication [OPTIONS] [<TABLENAME>...]
    

    Options:-f, --fix Use this option to fix replication issues.

    Supported in CDH 6.1.0 and later.

  • reportMissingRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
    

    Use this command when regions missing from hbase:meta but directories are still present in HDFS.

    Supported in CDH 6.1.0 and later.

  • setRegionState <ENCODED_REGIONNAME> <STATE>

    Possible region states: OFFLINE, OPENING, OPEN, CLOSING, CLOSED, SPLITTING, SPLIT, FAILED_OPEN, FAILED_CLOSE, MERGING, MERGED, SPLITTING_NEW, MERGING_NEW, ABNORMALLY_CLOSED.

    CAUTION:
    This command is recommended to be used only as a last resort. Example scenarios include unassigns/assigns that does not happen because the region is in an inconsistent state in hbase:meta.

    Supported in CDH 6.1.0 and later.

  • setTableState <TABLENAME> <STATE>

    Possible table states and representations in hbase:meta table: ENABLED (\x08\x00), DISABLED (\x08\x01), DISABLING (\x08\x02), ENABLING (\x08\x03).

    Supported in CDH 6.1.0 and later.

  • scheduleRecoveries <SERVERNAME>...

    Schedule ServerCrashProcedure(SCP) for list of RegionServers. Format server name as '<HOSTNAME>,<PORT>,<STARTCODE>' .

    Supported in CDH 6.2.0 and later.

  • unassigns <ENCODED_REGIONNAME>...

    Options:-o,--override Use this option to override ownership by another procedure.

    Supported in CDH 6.1.0 and later.