Hive tables - Managed and External
Managed tables are Hive owned tables where the entire lifecycle of the tables’ data are managed and controlled by Hive. External tables are tables where Hive has loose coupling with the data.
All the write operations to the Managed tables are performed using Hive SQL commands. If a Managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. The transactional semantics (ACID) are also supported only on Managed tables.
The writes on External tables can be performed using Hive SQL commands but data files can also be accessed and managed by processes outside of Hive. If an External table or partition is dropped, only the metadata associated with the table or partition is deleted but the underlying data files stay intact. A typical example for External table is to run analytical queries on HBase or Druid owned data via Hive, where data files are written by HBase or Druid and Hive reads them for analytics.
Hive supports replication of External tables with data to target cluster and it retains all the properties of External tables.
The data files permission and ownership are preserved so that the relevant external processes can continue to write in it even after failover.
Hive Materialized Views replication is not supported. However, DLM does not skip it from getting replicated, but it may not work as expected in the target cluster.
For handling conflicts in External tables’ data location due to replication from multiple
source clusters to same target cluster, DLM assigns a unique base directory for each
source cluster under which, External tables data from corresponding source cluster would
be copied. For example, if External table location at a source cluster is
/ext/hbase_data and after replication, the location in target cluster would
be <base_dir>/ext/hbase_data. Users can track the new location of External
DESCRIBE TABLE command.
DLM upgrade use-cases: In a normal scenario, if you had External tables that were replicated as Managed tables, after the upgrade process, you must drop those tables from target and set the base directory. In the next instance they get replicated as External tables.
When you are running Hive policies, and after upgrading from DLM 1.4.0 to 1.5.0, the running policies, if they enter the suspended state, you must reactivate them.
When you are replicating from HDP 2.6.5 to 3.1 cluster, if the source table is managed by Hive but the table location is not owned by Hive, at the target cluster, the table is created as Managed table. Later, if you upgrade 2.6.5 to 3.1 using the upgrade tool, the Managed table is automatically converted to External table. But the same rule is not followed during the replication process. It may happen that, the table type is External at the source cluster but is Managed at the target cluster. You must make sure that before you upgrade, the Hive user has the ownership of the table location in source cluster.
Handle replication conflicts between HDFS and Hive External Table location:
When you run the Hive replication policy on an external table, the data is stored on the target directory at a specific location. Next, when you run the HDFS replication policy which tries to copy data at the same external table location, DLM Engine ensures that the Hive data is not overridden by HDFS. For example: Running the Hive policy on an external table creates a target directory named: /tmp/db1/ext1. When HDFS policy is executed, the HDFS should not override data by replicating on /tmp/db1/ext1 directory.
How to avoid conflicts during External Tables replication process
When two Hive replication policies on DB1 and DB2 (either from same source cluster or different clusters) have external tables pointing to the same data location (example: /abc), and if they are replicated to the same target cluster, it must be noted that we need to set different paths for external table base directory configuration for both the policies (example: /db1 for DB1 and /db2 for DB2). This arrangement ensures that the target external table data location would be different for both DBs (/db1/abcd and /db2/abcd respectively).
Replication conflicts is NOT supported from On-Premise to Cloud scenario.
Non-support of replication of Hive-Managed tables written by Spark applications.
DLM Hive replication for Managed tables relies on replication events being published by Hive in Hive Metastore for every change that is made by Hive.
Applications other than Hive do not always publish events for new data file addition to Managed tables. The list of such applications includes Spark. This can result in data loss if these applications write to a Managed table in HDP 2.6.5. External tables should be used for data written by such applications. While replication for External table has some overheads, it will capture files that have been added without any event generation as well.
With Spark, the use of