Verifying and validating if your data is migrated

You can use the SyncTable command with the --dryrun parameter to verify if the tables are in sync between your source and your destination clusters. The SyncTable --dryrun option makes this run of your SyncTable command as read-only.

The HashTable and SyncTable jobs compose a tool implemented as two map-reduce jobs that must be executed as individual steps. It is similar to the CopyTable tool, which can perform both partial or entire table data copy. Unlike CopyTable it only copies diverging data between target clusters, saving both network and computing resources during the copy procedure.

Run the HashTable MapReduce job. This must be run on the cluster whose data is copied to the remote peer, usually the source cluster.

hbase org.apache.hadoop.hbase.mapreduce.HashTable --families=cf my-table /hashes/test-tbl
…
20/04/28 05:05:48 INFO mapreduce.Job:  map 100% reduce 100%
20/04/28 05:05:49 INFO mapreduce.Job: Job job_1587986840019_0001 completed successfully
20/04/28 05:05:49 INFO mapreduce.Job: Counters: 68
…
File Input Format Counters 
Bytes Read=0
File Output Format Counters 
Bytes Written=6811788

Once the HashTable job execution with the above command is completed, some output files are generated in the source hdfs /hashes/my-table directory. These files are needed as an input for the SyncTable execution.

hdfs dfs -ls -R /hashes/test-tbl
drwxr-xr-x   - root supergroup          0 2020-04-28 05:05 /hashes/test-tbl/hashes
-rw-r--r--   2 root supergroup          0 2020-04-28 05:05 /hashes/test-tbl/hashes/_SUCCESS
drwxr-xr-x   - root supergroup          0 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000
-rw-r--r--   2 root supergroup    6790909 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000/data
-rw-r--r--   2 root supergroup      20879 2020-04-28 05:05 /hashes/test-tbl/hashes/part-r-00000/index
-rw-r--r--   2 root supergroup         99 2020-04-28 05:04 /hashes/test-tbl/manifest
-rw-r--r--   2 root supergroup        153 2020-04-28 05:04 /hashes/test-tbl/partitions

Launch the SyncTable at the target peer. The following command runs SyncTable for the output of HashTable from the previous step. It uses the --dryrun parameter.
```
hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://source-cluster-active-nn/hashes/test-tbl test-tbl test-tbl
…
org.apache.hadoop.hbase.mapreduce.SyncTable$SyncMapper$Counter
BATCHES=97148
HASHES_MATCHED=97146
HASHES_NOT_MATCHED=2
MATCHINGCELLS=17
MATCHINGROWS=2
RANGESNOTMATCHED=2
ROWSWITHDIFFS=2
SOURCEMISSINGCELLS=1
TARGETMISSINGCELLS=1
```
In the previous output, the SyncTable is reporting two rows diverging in both source and target (ROWSWITHDIFFS=2), where one row has a cell value in source not present in target (TARGETMISSINGCELLS=1), and another row has a cell value in the target that is not present in source (SOURCEMISSINGCELLS=1).
note
You might replace the given parameters in the above examples with your actual environment values.

The HashTable or SyncTable jobs are designed to operate on individual tables. If multiple tables need to be migrated, you must execute these jobs separately for each table.
If the data in a table is modified either through ingestion or deletion on the source or destination, the job reports mismatches. To narrow the scope of data being checked, you can use the --starttime or --endtime options. For more information, see the Hashtable reference guide section.

Verifying and validating if your data is migrated

We want your opinion

How can we improve this page?