Managed tables are Hive owned tables where the entire lifecycle of the tables' data are managed and controlled by Hive. External tables are tables where Hive has loose coupling with the data. Replication Manager replicates external tables successfully to a target cluster. The managed tables are converted to external tables after replication.
Hive supports replication of external tables with data to target cluster and it retains all the properties of external tables. The data files' permission and ownership are preserved so that the relevant external processes can continue to write in it even after failover.
The writes on external tables are performed using the Hive SQL commands and the data files can also be accessed and managed by processes outside of Hive. If an external table or partition is dropped, only the metadata associated with the table or partition is deleted but the underlying data files stay intact. A typical example for an external table is to run analytical queries on HBase or Druid owned data using Hive, where the data files are written by HBase or Druid and Hive reads them for analytics.
When you create a schedule for a Hive replication policy, set the frequency so that changes are replicated often enough to avoid overly large copies.
You might come across the following use cases during Hive replication:
- Replication Manager upgrade use case
- In a normal scenario, if you have external tables that are replicated as managed tables, after the upgrade process, you must drop those tables from the target cluster and set the base directory. In the next instance, they get replicated as external tables.
- Conflicts in external tables’ data location for multiple source clusters replication to the same target cluster
- To handle the conflicts in external tables’ data location for multiple source clusters replication to the same target cluster, the Replication Manager assigns an unique base directory for each source cluster under which the external tables' data from the corresponding source cluster is copied.
- For example, if the external table location in a source cluster is /ext/hbase_data, then the location in the target cluster after replication is <base_dir>/ext/hbase_data. You can use the DESCRIBE TABLE command to track the new location of external tables.
- Replication conflicts between HDFS and Hive external table location
- When you run the Hive replication policy on an external table, the data is stored on the target directory at a specific location. Next, when you run the HDFS replication policy which tries to copy data at the same external table location, Replication Manager ensures that the Hive data is not overridden by HDFS.
- For example, when you run a Hive replication policy on an external table, the policy creates a target directory /tmp/db1/ext1. When you run an HDFS replication policy, the policy should not override the data by replicating on the /tmp/db1/ext1 directory.
- Conflicts during external tables replication process
- Conflicts appear when two Hive replication policies on DB1 and DB2 (either from the same source cluster or different source clusters) have external tables that point to the same data location (for example, /abc) and are replicated to the same target cluster. To avoid such conflicts, you must set different paths for the external table base directory configuration, for both the policies.
- For example, set /db1 for DB1 and /db2 for DB2. This ensures that the target external table data location is different for both databases. For example, /db1/abcd and /db2/abcd.