Verifying and validating if your data is migrated

You can use the SyncTable command with the --dryrun parameter to verify if the tables are in sync between your source and your destination clusters. The SyncTable --dryrun option makes this run of your SyncTable command as read-only.

The HashTable and SyncTable jobs compose a tool implemented as two map-reduce jobs that must be executed as individual steps. It is similar to the CopyTable tool, which can perform both partial or entire table data copy. Unlike CopyTable it only copies diverging data between target clusters, saving both network and computing resources during the copy procedure.

  1. Run the HashTable MapReduce job. This must be run on the cluster whose data is copied to the remote peer, usually the source cluster.
    hbase org.apache.hadoop.hbase.mapreduce.HashTable --families=cf my-table /hashes/test-tbl
    …
    20/04/28 05:05:48 INFO mapreduce.Job:  map 100% reduce 100%
    20/04/28 05:05:49 INFO mapreduce.Job: Job job_1587986840019_0001 completed successfully
    20/04/28 05:05:49 INFO mapreduce.Job: Counters: 68
    …
    File Input Format Counters 
    Bytes Read=0
    File Output Format Counters 
    Bytes Written=6811788
    Once the HashTable job execution with the above command is completed, some output files are generated in the source hdfs /hashes/my-table directory. These files are needed as an input for the SyncTable execution.
    hdfs dfs -ls -R /hashes/test-tbl
    drwxr-xr-x   - root supergroup          0 2020-04-28 05:05 /hashes/test-tbl/hashes
    -rw-r--r--   2 root supergroup          0 2020-04-28 05:05 /hashes/test-tbl/hashes/_SUCCESS
    drwxr-xr-x   - root supergroup          0 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000
    -rw-r--r--   2 root supergroup    6790909 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000/data
    -rw-r--r--   2 root supergroup      20879 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000/index
    -rw-r--r--   2 root supergroup         99 2020-04-28 05:04 /hashes/test-tbl/manifest
    -rw-r--r--   2 root supergroup        153 2020-04-28 05:04 /hashes/test-tbl/partitions
  2. Launch the SyncTable at the target peer. The following command runs SyncTable for the output of HashTable from the previous step. It uses the --dryrun parameter.
    hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://source-cluster-active-nn/hashes/test-tbl test-tbl test-tbl
    …
    org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$Counter
    BATCHES=97148
    HASHES_MATCHED=97146
    HASHES_NOT_MATCHED=2
    MATCHINGCELLS=17
    MATCHINGROWS=2
    RANGESNOTMATCHED=2
    ROWSWITHDIFFS=2
    SOURCEMISSINGCELLS=1
    TARGETMISSINGCELLS=1

    In the previous output, the SyncTable is reporting two rows diverging in both source and target (ROWSWITHDIFFS=2), where one row has a cell value in source not present in target (TARGETMISSINGCELLS=1), and another row has a cell value in the target that is not present in source (SOURCEMISSINGCELLS=1).