Hive tables - Managed and External
Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are managed and controlled by Hive. External tables are tables where Hive has loose coupling with the data.
All the write operations to the Managed tables are performed using Hive SQL commands. If a Managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. The transactional semantics (ACID) are also supported only on Managed tables.
The writes on External tables can be performed using Hive SQL commands but data files can also be accessed and managed by processes outside of Hive. If an External table or partition is dropped, only the metadata associated with the table or partition is deleted but the underlying data files stay intact. A typical example for External table is to run analytical queries on HBase or Druid owned data via Hive, where data files are written by HBase or Druid and Hive reads them for analytics.
Hive supports replication of External tables with data to target cluster and it retains all the properties of External tables.
The data files permission and ownership are preserved so that the relevant external processes can continue to write in it even after failover.
For handling conflicts in External tables’ data location due to replication from
multiple source clusters to same target cluster, Replication Manager assigns a unique
base directory for each source cluster under which, External tables data from
corresponding source cluster would be copied. For example, if External table location at
a source cluster is /ext/hbase_data and after replication, the location in target
cluster would be <base_dir>/ext/hbase_data. Users can track the new
location of External tables using
DESCRIBE TABLE command.
Handle replication conflicts between HDFS and Hive External Table location:
When you run the Hive replication policy on an external table, the data is stored on the target directory at a specific location. Next, when you run the HDFS replication policy which tries to copy data at the same external table location, DLM Engine ensures that the Hive data is not overridden by HDFS. For example: Running the Hive policy on an external table creates a target directory named: /tmp/db1/ext1. When HDFS policy is executed, the HDFS should not override data by replicating on /tmp/db1/ext1 directory.
How to avoid conflicts during External Tables replication process
When two Hive replication policies on DB1 and DB2 (either from same source cluster or different clusters) have external tables pointing to the same data location (example: /abc), and if they are replicated to the same target cluster, it must be noted that we need to set different paths for external table base directory configuration for both the policies (example: /db1 for DB1 and /db2 for DB2). This arrangement ensures that the target external table data location would be different for both DBs (/db1/abcd and /db2/abcd respectively).