Use snapshots

A snapshot captures the state of a table at the time the snapshot was taken

Cloudera recommends snapshots instead of CopyTable where possible. Because no data is copied when a snapshot is taken, the process is very quick. As long as the snapshot exists, cells in the snapshot are never deleted from HBase, even if they are explicitly deleted by the API. Instead, they are archived so that the snapshot can restore the table to its state at the time of the snapshot.

After taking a snapshot, use the clone_snapshot command to copy the data to a new (immediately enabled) table in the same cluster, or the Export utility to create a new table based on the snapshot, in the same cluster. This is a copy-on-write operation. The new table shares HFiles with the original table until writes occur in the new table but not the old table, or until a compaction or split occurs in either of the tables. This can improve performance in the short term compared to CopyTable.

To export the snapshot to a new cluster, use the ExportSnapshot utility, which uses MapReduce to copy the snapshot to the new cluster. Run the ExportSnapshot utility on the source cluster, as a user with HBase and HDFS write permission on the destination cluster, and HDFS read permission on the source cluster. This creates the expected amount of IO load on the destination cluster. Optionally, you can limit bandwidth consumption, which affects IO on the destination cluster. After the ExportSnapshot operation completes, you can see the snapshot in the new cluster using the list_snapshot command, and you can use the clone_snapshot command to create the table in the new cluster from the snapshot.

For full instructions for the snapshot and clone_snapshot HBase Shell commands, run the HBase Shell and type help snapshot. The following example takes a snapshot of a table, uses it to clone the table to a new table in the same cluster, and then uses the ExportSnapshot utility to copy the table to a different cluster, with 16 mappers and limited to 200 Mb/sec bandwidth.

$ bin/hbase shell
   hbase(main):005:0> snapshot 'TestTable', 'TestTableSnapshot'
   0 row(s) in 2.3290 seconds
   
   hbase(main):006:0> clone_snapshot 'TestTableSnapshot', 'NewTestTable'
   0 row(s) in 1.3270 seconds
   
   hbase(main):007:0> describe 'NewTestTable'
   DESCRIPTION                                          ENABLED
   'NewTestTable', {NAME => 'cf1', DATA_BLOCK_ENCODING true
   => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE
   => '0', VERSIONS => '1', COMPRESSION => 'NONE', MI
   N_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_C
   ELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY =>
   'false', BLOCKCACHE => 'true'}, {NAME => 'cf2', DA
   TA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW',
   REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESS
   ION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER
   ', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '655
   36', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
   1 row(s) in 0.1280 seconds
   hbase(main):008:0> quit
   
   $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot TestTableSnapshot -copy-to file:///tmp/hbase -mappers 16 -bandwidth 200
   14/10/28 21:48:16 INFO snapshot.ExportSnapshot: Copy Snapshot Manifest
   14/10/28 21:48:17 INFO client.RMProxy: Connecting to ResourceManager at a1221.example.com/192.0.2.121:8032
   14/10/28 21:48:19 INFO snapshot.ExportSnapshot: Loading Snapshot 'TestTableSnapshot' hfile list
   14/10/28 21:48:19 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
   14/10/28 21:48:19 INFO util.FSVisitor: No logs under directory:hdfs://a1221.example.com:8020/hbase/.hbase-snapshot/TestTableSnapshot/WALs
   14/10/28 21:48:20 INFO mapreduce.JobSubmitter: number of splits:0
   14/10/28 21:48:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1414556809048_0001
   14/10/28 21:48:20 INFO impl.YarnClientImpl: Submitted application application_1414556809048_0001
   14/10/28 21:48:20 INFO mapreduce.Job: The url to track the job: http://a1221.example.com:8088/proxy/application_1414556809048_0001/
   14/10/28 21:48:20 INFO mapreduce.Job: Running job: job_1414556809048_0001
   14/10/28 21:48:36 INFO mapreduce.Job: Job job_1414556809048_0001 running in uber mode : false
   14/10/28 21:48:36 INFO mapreduce.Job:  map 0% reduce 0%
   14/10/28 21:48:37 INFO mapreduce.Job: Job job_1414556809048_0001 completed successfully
   14/10/28 21:48:37 INFO mapreduce.Job: Counters: 2
   Job Counters
   Total time spent by all maps in occupied slots (ms)=0
   Total time spent by all reduces in occupied slots (ms)=0
   14/10/28 21:48:37 INFO snapshot.ExportSnapshot: Finalize the Snapshot Export
   14/10/28 21:48:37 INFO snapshot.ExportSnapshot: Verify snapshot integrity
   14/10/28 21:48:37 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
   14/10/28 21:48:37 INFO snapshot.ExportSnapshot: Export Completed: TestTableSnapshot

The url to track the job: contains the URL from which you can track the ExportSnapshot job. When it finishes, a new set of HFiles, comprising all of the HFiles that were part of the table when the snapshot was taken, is created at the HDFS location you specified.

You can use the SnapshotInfo command-line utility included with HBase to verify or debug snapshots.