Timeline Consistency
With timeline consistency, HBase introduces a Consistency definition that can be provided per read operation (get or scan):
public enum Consistency { STRONG, TIMELINE }
Consistency.STRONG
is the default consistency model provided by HBase. If a table has region
replication = 1, or has region replicas but the reads are done with time consistency enabled, the read is always
performed by the primary regions. This preserves previous behavior; the client receives the latest data.
If a read is performed with Consistency.TIMELINE
, then the read
RPC is sent to the primary RegionServer first. After a short interval
(hbase.client.primaryCallTimeout.get
, 10ms by default), a parallel RPC
for secondary region replicas is sent if the primary does not respond back. HBase returns the
result from whichever RPC finishes first. If the response is from the primary region replica,
the data is current. You can use Result.isStale()
API to determine the
state of the resulting data:
If the result is from a primary region,
Result.isStale()
is set to false.If the result is from a secondary region,
Result.isStale()
is set to true.
TIMELINE
consistency as implemented by HBase differs from pure eventual consistency in
the following respects:
Single homed and ordered updates: Whether region replication is enabled or not, on the write side, there is still only one defined replica (primary) that can accept writes. This replica is responsible for ordering the edits and preventing conflicts. This guarantees that two different writes are not committed at the same time by different replicas, resulting in divergent data. With this approach, there is no need to do read-repair or last-timestamp-wins types of of conflict resolution.
The secondary replicas also apply edits in the order that the primary committed them, thus the secondaries contain a snapshot of the primary's data at any point in time. This is similar to RDBMS replications and HBase’s own multi-datacenter replication, but in a single cluster.
On the read side, the client can detect whether the read is coming from up-to-date data or is stale data. Also, the client can issue reads with different consistency requirements on a per-operation basis to ensure its own semantic guarantees.
The client might still read stale data if it receives data from one secondary replica first, followed by reads from another secondary replica. There is no stickiness to region replicas, nor is there a transaction ID-based guarantee. If required, this can be implemented later.
Memory Accounting
Secondary region replicas refer to data files in the primary region replica, but they have their own MemStores (in HA Phase 2) and use block cache as well. However, one distinction is that secondary region replicas cannot flush data when there is memory pressure for their MemStores. They can only free up MemStore memory when the primary region does a flush and the flush is replicated to the secondary.
Because a RegionServer can host primary replicas for some regions and secondaries for others, secondary replicas might generate extra flushes to primary regions in the same host. In extreme situations, there might be no memory for new writes from the primary, via WAL replication.
To resolve this situation, the secondary replica is allowed to do a “store file
refresh.” A file system list operation picks up new files from the primary, possibly dropping
its MemStore. This refresh is only performed if the MemStore size of the biggest secondary
region replica is at least
hbase.region.replica.storefile.refresh.memstore.multiplier
times bigger
than the biggest MemStore of a primary replica. (The default value for
hbase.region.replica.storefile.refresh.memstore.multiplier
is 4.)
Note | |
---|---|
If this operation is performed, the secondary replica might obtain partial row updates across column families (because column families are flushed independently). We recommend that you configure HBase to not do this operation frequently. You can disable this feature by setting the value to a large number, but this might cause replication to be blocked without resolution. |
Secondary Replica Failover
When a secondary region replica first comes online, or after a secondary region fails over, it may have contain edits from its MemStore. The secondary replica must ensure that it does accesss stale data (data that has been overwritten) before serving requests after assignment. Therefore, the secondary waits until it detects a full flush cycle (start flush, commit flush) or a “region open event” replicated from the primary.
Until the flush cycle occurs, the secondary region replica rejects all read requests via an IOException with the following message:
The region's reads are disabled
Other replicas are probably still be available to read, thus not causing any impact for the RPC with TIMELINE consistency.
To facilitate faster recovery, the secondary region triggers a flush request from
the primary when it is opened. The configuration property
hbase.region.replica.wait.for.primary.flush
(enabled by default) can be
used to disable this feature if needed.