ACID tables replication
Hive managed tables supporting insert, update, delete operations with ACID semantics are called ACID tables. ACID tables support only ORC file format.
-
Hive managed tables supporting Insert-only operations with ACID semantics are called MM (Micro-Managed) OR Insert-Only ACID tables. Supports all file formats.
- All transactions run at Snapshot Isolation level which allows concurrent read/write queries to execute without any locks
- Supports SQL Merge
- Supports Insert-only Streaming Ingest API
- Supports compaction to merge smaller data files and vacuum obsolete data without blocking concurrent queries
Replicating ACID tables in DLM
ACID table replication involves full ACID and micro-managed table replication. Full-scale ACID table replication is supported ONLY on HDP 3.x clusters. In HDP 3.x, by default, all the Managed tables are defined as ACID tables. As part of replicating ACID table, Hive replicates both data and metadata and the transactional state of the table. In other words, the ACID semantics are applicable on replicated ACID table as well. ACID tables replication supports both bootstrap and incremental modes of replication.
Caution | |
---|---|
For streaming ingestion using HiveStreamingConnection API,
only transaction batch size of 1 is supported. If user performs ingestion with
transactions batch size > 1, replication might fail. |
Bootstrap replication
Bootstrap replication takes a snapshot of all ACID tables data and replicates it.
Replicating a snapshot of data makes sure only committed data gets replicated across
clusters and reduce bandwidth utilization. Bootstrap dump gets blocked if there are
any transactions that was opened earlier and which are still in running state. If
any long running transactions are present, user should abort it or else Hive would
automatically abort those transactions after timeout value configured in
hive.repl.bootstrap.dump.open.txn.timeout
.
Incremental replication
Incremental replication replicates the transactional state of tables as it is and ensures Point-in-time consistency at all times. The replicated table is readable at any point of time during and after replication.
- HDP 2.6.5 does not support ACID tables replication.
- Hive compaction should be auto-triggered in the replica cluster as replication does not support replicating the compacted data files from primary to replica. If compaction is not triggered at replica cluster, it may cause overload on HMS-backend database and queries may slow down or could fail.
- The transactions opened by Hive replication process in the replica cluster shall not be timed-out. If replication process gets stalled or fails for reasons such as network delay, user should not manually abort it. Instead, wait for DLM to retry after failure.
- If Hive replication fails with non-recoverable failures, it needs to re-bootstrap the whole database by manually dropping bootstrap from scratch.
HDP version upgrade
Existing DLM replication policies on HDP 2.6.5 release, when upgraded to the HDP 3.x release, shall be treated as normal and will continue to be used. By default, ACID tables are not replicated by replication policies running on HDP 2.6.5 clusters. Once upgraded to HDP 3.x, they will be automatically enabled and gets replicated from the next replication schedule.