Using the HBCK2 Tool to Remediate HBase Clusters
The HBCK2 tool is a repair tool to remediate Apache HBase clusters in CDH. The HBCK2 tool is the next version of the Apache HBase hbck tool.
To identify a list of inconsistencies or blockages in a running HBase cluster, you can view or search the logs using the log search feature in Cloudera Manager. Once you have identified the issue, you can then use the HBCK2 tool to fix the defect or to skip-over a bad state. The HBCK2 tool uses an interactive fix-it process by asking the Master to make the fixes rather than carry out the repair locally.
The HBCK2 performs a single, discrete task each time it is run. The HBCK2 tool does not analyze everything in a running cluster and repair all the problems. Instead, you can use the HBCK2 tool to iteratively find and fix issues in your cluster. The HBCK2 tool lets you use interactive commands to fix one issue at a time.
Supported Versions
You can use the HBCK2 tool with these versions of CDH:
-
CDH 6.1.x
-
CDH 6.2.x
-
CDH 6.3.x and later
Running the HBCK2 Tool
The HBCK2 tool is a part of the hbase-operator-tools binary. Once you get the hbase-operator-tools binary from Cloudera, upload the binary tarball to the target cluster and extract the tarball. The HBCK2 JAR file is contained in the operator tools tarball provided by Cloudera Support at hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar.
You can run the HBCK2 tool by specifying the JAR path with the “-j” option as shown here:
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar
When you run the command, the HBCK2 tool command-line menu appears.
$ hdfs dfs -ls -R /hbase/ 2>&1 | tee /tmp/hdfs-ls.txt $ hbase hbck -details 2>&1 | tee /tmp/hbase-hbck.txt $ echo "scan 'hbase:meta'" | hbase shell 2>&1 | tee /tmp/hbase-meta.txt
Finding Issues
The HBCK2 tool enables you to use interactive commands to fix one issue at a time. If you have multiple issues, you may have to run the tool iteratively to find and resolve all the issues. Use the following utilities and commands to find the issues.
Find issues using diagnostic tools
Master logs
The Apache HBase Master runs all the cluster start and stop operations, RegionServer assignment, and server crash handling. Everything that the Master does is a procedure on a state machine engine and each procedure has an unique procedure ID (PID). You can trace the lifecycle of a procedure by tracking its PID through the entries in the Master log. Some procedures may spawn sub-procedures and wait for the sub-procedure to complete.
You can trace the sub-procedure by tracking its PID and the parent PID (PPID).
2018-09-12 15:29:06,558 WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=va1001.example.org,00001,1000173230599, table=IntegrationTestBigLinkedList_20180626110336, region=dbdb56242f17610c46ea044f7a42895b
Master user interface
Status tables
You can find issues in your HBase tables by looking at the status tables section in the Master user interface home page. Look through the list of tables to identify if a table is ENABLED, ENABLING, DISABLED, or DISABLING. You can also take a look at the regions in transition states: OPEN, CLOSED. For example, there may be an issue if a table is ENABLED, some regions are not in the OPEN state, and the Master log entries do not have any ongoing assignments.
Procedures and locks
When an Apache HBase cluster is started, the Procedures & Locks page in the Master user interface is populated with information about the procedures, locks, and the count of WAL files. After the cluster settles, if the WAL file count does not reduce, it leads to procedure blocks. You can identify those procedures and locks on this page.
$ echo "list_locks"| hbase shell &> /tmp/locks.txt $ echo "list_procedures"| hbase shell &> /tmp/procedures.txt
Apache HBase canary tool
$ hbase canary -f false -t 6000000 &>/tmp/canary.log
Use the -f parameter to look for failed region fetches, and set the -t parameter to run for a specified time.
Fixing Issues
You must keep these in mind when fixing issues using HBCK2. Ensure that:
- A region is not in the CLOSING state during “assign”, and in the OPENING state during “unassign”. You can change the state using the setRegionState command. See the HBCK2 tool Command Reference section for more information.
- You fix only one table at a time.
Fix assign and unassign issues
You can fix assign and unassign issues by monitoring the current list of outstanding locks. An assign against a locked region will wait till the lock is released. An assignment gets an exclusive lock on the region.
Fix master startup cannot progress error
2020-04-01 22:07:42,792 WARN org.apache.hadoop.hbase.master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=CLOSING, ts=1538456302300, server=ve1017.example.org,22101,1234567891012}; ServerCrashProcedures=true. Master startup cannot progress in holding-pattern until region onlined.
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar assigns 1588230740
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar assigns <hbase:namespace encoded region id>
$ echo "scan 'hbase:meta',{COLUMNS=>'info:regioninfo', FILTER=>\"PrefixFilter('hbase:namespace')\"}" | hbase shell
The namespace encoded region id is the value under the "ENCODED" field in the results.
Fix missing regions in hbase:meta region/table
If you encounter an issue where table regions have been removed from the hbase:meta table, you can use the addFsRegionsMissingInMeta to resolve this issue. Ensure that the Master is online. This command is not as disruptive as the hbase:meta rebuild command.
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>
The command returns an HBCK2 “assigns” command with all the listed re-inserted regions.You must restart the Master, and then run the HBCK2 'assigns' command returned by the addFsRegionsMissingInMeta command to complete your fix.
Regions re-added into Meta: 2 WARNING: 2 regions were added to META, but these are not yet on Masters cache. You need to restart Masters, then run hbck2 'assigns' command below: assigns 7be03127c5e0e2acfc7cae7ddfa9e29e e50b8c1adc38c942e226a8b2976f0c8c
Fix extra regions in hbase:meta region/table
If there are extra regions in hbase:meta, it may be because of problems in splitting, deleting/moving the region directory manually, or in rare cases because of the loss of metadata.
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar extraRegionsInMeta --fix <NAMESPACE|NAMESPACE:TABLENAME>...
Rebuild hbase:meta
If hbase:meta is offline because it is corrupted, you can bring it back online if the corruption is not too critical. If the namespace region is among the mission regions, scan hbase:meta during initialization to check if hbase:meta is online.
$ echo "scan 'hbase:meta', {COLUMN=>'info:regioninfo'}" | hbase shell
$ hbase hbck -j $HOME/hbase-operator-tools-<version>/hbase-hbck2/hbase-hbck2-<version>.jar addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>
The addFsRegionsMissingInMeta command adds regions back to the hbase:meta table if the regioninfo file is present in the storage but the regions were deleted because of an issue.
HBCK2 Tool Command Reference
-
addFsRegionsMissingInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
Options: -d,--force_disable Use this option to abort fix for table if disable fails.
Supported from CDH 6.1.0 and later.
-
assigns [OPTIONS] <ENCODED_REGIONNAME>...
Options:-o,--override Use this option to override ownership by another procedure.
Supported in CDH 6.1.0 and later.
-
bypass [OPTIONS] <PID>...
Options: -o,--override Use this option to override if procedure is running/stuck -r,--recursive Use this option to bypass parent and its children.
-w,--lockWait Use this option to wait (in milliseconds) before giving up; default=1.
Supported in CDH 6.1.0 and later.
-
extraRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
Options:-f, --fix Use this option to fix meta by removing all extra regions found.
Supported from CDH 6.1.0 and later.
-
filesystem [OPTIONS] [<TABLENAME>...]
Options:-f, --fix Use this option to sideline corrupt HFiles, bad links, and references.
Supported in CDH 6.1.0 and later.
-
replication [OPTIONS] [<TABLENAME>...]
Options:-f, --fix Use this option to fix replication issues.
Supported in CDH 6.1.0 and later.
-
reportMissingRegionsInMeta <NAMESPACE|NAMESPACE:TABLENAME>...
Use this command when regions missing from hbase:meta but directories are still present in HDFS.
Supported in CDH 6.1.0 and later.
-
setRegionState <ENCODED_REGIONNAME> <STATE>
Possible region states: OFFLINE, OPENING, OPEN, CLOSING, CLOSED, SPLITTING, SPLIT, FAILED_OPEN, FAILED_CLOSE, MERGING, MERGED, SPLITTING_NEW, MERGING_NEW, ABNORMALLY_CLOSED.
CAUTION:This command is recommended to be used only as a last resort. Example scenarios include unassigns/assigns that does not happen because the region is in an inconsistent state in hbase:meta.Supported in CDH 6.1.0 and later.
-
setTableState <TABLENAME> <STATE>
Possible table states and representations in hbase:meta table: ENABLED (\x08\x00), DISABLED (\x08\x01), DISABLING (\x08\x02), ENABLING (\x08\x03).
Supported in CDH 6.1.0 and later.
-
scheduleRecoveries <SERVERNAME>...
Schedule ServerCrashProcedure(SCP) for list of RegionServers. Format server name as '<HOSTNAME>,<PORT>,<STARTCODE>' .
Supported in CDH 6.2.0 and later.
-
unassigns <ENCODED_REGIONNAME>...
Options:-o,--override Use this option to override ownership by another procedure.
Supported in CDH 6.1.0 and later.