You can use the SyncTable command with the --dryrun parameter to verify if the tables
are in sync between your source and your destination clusters. The SyncTable --dryrun option
makes this run of your SyncTable command as read-only.
The HashTable and SyncTable jobs compose a tool implemented as two map-reduce jobs
that must be executed as individual steps. It is similar to the CopyTable tool,
which can perform both partial or entire table data copy. Unlike CopyTable it only
copies diverging data between target clusters, saving both network and computing
resources during the copy procedure.
-
Run the HashTable MapReduce job. This must be run on the cluster whose data is
copied to the remote peer, usually the source cluster.
hbase org.apache.hadoop.hbase.mapreduce.HashTable --families=cf my-table /hashes/test-tbl
…
20/04/28 05:05:48 INFO mapreduce.Job: map 100% reduce 100%
20/04/28 05:05:49 INFO mapreduce.Job: Job job_1587986840019_0001 completed successfully
20/04/28 05:05:49 INFO mapreduce.Job: Counters: 68
…
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=6811788
Once the HashTable job execution with the above command is completed,
some output files are generated in the source
hdfs
/hashes/my-table directory. These files are needed as an input
for the SyncTable
execution.
hdfs dfs -ls -R /hashes/test-tbl
drwxr-xr-x - root supergroup 0 2020-04-28 05:05 /hashes/test-tbl/hashes
-rw-r--r-- 2 root supergroup 0 2020-04-28 05:05 /hashes/test-tbl/hashes/_SUCCESS
drwxr-xr-x - root supergroup 0 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000
-rw-r--r-- 2 root supergroup 6790909 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000/data
-rw-r--r-- 2 root supergroup 20879 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000/index
-rw-r--r-- 2 root supergroup 99 2020-04-28 05:04 /hashes/test-tbl/manifest
-rw-r--r-- 2 root supergroup 153 2020-04-28 05:04 /hashes/test-tbl/partitions
-
Launch the SyncTable at the target peer. The following command runs SyncTable
for the output of HashTable from the previous step. It uses the
--dryrun parameter.
hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://source-cluster-active-nn/hashes/test-tbl test-tbl test-tbl
…
org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$Counter
BATCHES=97148
HASHES_MATCHED=97146
HASHES_NOT_MATCHED=2
MATCHINGCELLS=17
MATCHINGROWS=2
RANGESNOTMATCHED=2
ROWSWITHDIFFS=2
SOURCEMISSINGCELLS=1
TARGETMISSINGCELLS=1
In the previous output, the SyncTable is reporting two rows diverging in both
source and target (ROWSWITHDIFFS=2)
, where one row has a
cell value in source not present in target
(TARGETMISSINGCELLS=1)
, and another row has a cell
value in the target that is not present in source
(SOURCEMISSINGCELLS=1)
.