Synchronize table data using HashTable/SyncTable tool

The HashTable/SyncTable tool can be used for partial or entire table data synchronization, under the same or remote cluster.

  • Ensure that all RegionServers/DataNodes on the source cluster is accessible by the NodeManagers on the target cluster where SyncTable job tasks will be running.
  • In the case of secured clusters, the user on the target cluster who executes the SyncTable job must be able to do the following on the HDFS and HBase services of the source cluster:
    • Authenticate: for example, using centralized authentication or cross-realm setup.
    • Be authorized: having at least read permission.
  1. Run HashTable on the source table cluster: HashTable [options] <tablename> <outputpath>
    The following is an example to hash the TesTable in 32kB batches for a 1 hour window into 50 files:
    $ hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTableA /hashes/testTable
  2. Run SyncTable on the target cluster: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>

    The following is an example for a dry run SyncTable of tableA from a remote source cluster to a local tableB on the target cluster:

    $ hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:8020/hashes/testTable testTableA testTableB