Metadata

Metadata refers to the schema and data required for correctly running Hadoop SQL workloads on top of Hive, Impala, or SparkSQL. Additionally, it also refers to business metadata created within Atlas to assign additional context to datasets. Hive, Impala, and SparkSQL share a metadata catalog for databases, tables, and schemas. When replicating or backing up data managed by these services, you need to consider both the data and catalog metadata together.

Hive Metastore🔗

The Hive Metastore (HMS) manages the metadata related to databases and table schema. Additionally, it tracks where the underlying table data resides on HDFS or Ozone.

HMS should be deployed in a High Available (HA) mode by configuring an additional HMS instance. This provides failover capabilities within a cluster to a secondary Hive Metastore if your primary instance goes down.

For more information about the Hive Metastore, see Configuring the HMS for High Availability.

Metadata within the HMS is generally co-replicated with the underlying Hive and Impala data when using Cloudera Replication Manager. This activity is configurable using the Cloudera Replication Manager replication policy associated with the Hive data set.

Atlas🔗

Atlas should be deployed in High Availability mode by configuring additional Atlas service instances. This provides failover capabilities within a cluster if the primary instance goes down.

Atlas depends upon HBase, Kafka, and the Infrastructure Solr. When enabling Atlas HA, these services must also be enabled for high availability support.

For more information about Atlas High Availability, see About Atlas High Availability.

Business metadata generated and stored within Atlas is used for data tagging and lineage. This information is not currently synchronized between disaster recovery environments. Should this be required, the Atlas metadata can be exported from one environment and imported to the other using Atlas REST API calls.