Non-support of replication of Hive-Managed tables written by Spark applications.
DLM Hive replication for Managed tables relies on replication events being published by Hive in Hive Metastore for every change that is made by Hive.
In case of External table replication, DLM replication does not rely on events being published and checks every table/partition directory for any new file that might have been added.
Important | |
---|---|
Applications other than Hive do not always publish events for new
data file addition to Managed tables. The list of such applications includes Spark.
This can result in data loss if these applications write to a Managed table in HDP
2.6.5. External tables should be used for data written by such applications. While
replication for External table has some overheads, it will capture files that have
been added without any event generation as well. |
In case of HDP 3.0 and later, Managed table permissions are only given to Hive system
user, and this helps to ensure that other applications do not write to a Managed table.
Important | |
---|---|
With Spark, the use of
hive.metastore.dml.events is not supported in HDP. Spark should
be treated as an application that does not reliably publish events for the
changes. |