Configure bulk load replication

Bulk loading is the process of preparing and loading HFiles directly into HBase RegionServers bypassing the write path. If you bulk load data into HBase frequently and want to replicate this data, you must configure bulk load replication.

Because bulk loading data bypasses the write path, and this process does not generate WALs, your data will not be replicated to the backup cluster. This prevents many issues such as:
  • MemStores getting full
  • WALs getting bigger
  • Compaction and flush queues becoming long
  • Garbage Collector getting out of control because of inserts range in the megabytes
  • Latency increasing when importing data

The standard HBase replication uses a source-push methodology. When the active cluster (source) receives an edit to a column family with replication enabled, that edit is propagated to all destination clusters using the WAL for that column family on the RegionServer managing the relevant region. However, in the case of bulk load, only the event (the fact that there is bulk load happening) is captured in the WAL, with reference to the HFile.

The data being loaded is not recorded. By enabling BulkLoad Replication, the active HBase RegionServer will also send these WAL entries to the peer cluster. Peer cluster will read these WAL entries and copy the HFiles from the active source cluster in the peer cluster staging directory, and basically, from here it’s just a standard bulk load.