Cloudera Navigator Metadata Architecture

Cloudera Navigator metadata features provides data discovery and data lineage functions. The Cloudera Navigator metadata architecture is illustrated below.

The Navigator Metadata Server performs the following functions:
  • Obtains connection information about CDH services from the Cloudera Manager Server
  • Extracts metadata for the entities managed by those services at periodic intervals
  • Manages and applies metadata extraction policies during metadata extraction
  • Indexes and stores entity metadata
  • Manages authorization data for Navigator users
  • Manages audit report metadata
  • Implements the Navigator UI and API

The Navigator Metadata database stores entity metadata, policies, and user authorization and audit report metadata.

The Cloudera Navigator Metadata Server manages metadata about the entities in a CDH cluster and relations between the entities. The metadata schema defines the types of metadata that are available for each entity type it supports.

The types of metadata defined by the Navigator Metadata component include: the name of an entity, the service that manages or uses the entity, type, path to the entity, date and time of creation, access, and modification, size, owner, purpose, and relations—parent-child, data flow, and instance of—between entities. For example, the following shows the property sheet of a file entity:

There are two classes of metadata:
  • technical metadata - metadata defined when entities are extracted. You cannot modify technical metadata.
  • custom metadata - metadata added to extracted entities. You can add and modify custom metadata before or after entities are extracted.

Metadata Extraction

The Navigator Metadata Server extracts metadata for the following resource types from the listed servers:
  • HDFS - Extracts HDFS metadata at the next scheduled extraction run after an HDFS checkpoint. However, if you have high availability enabled, metadata is extracted as soon as it is written to the JournalNodes.
  • Hive - Extracts database and table metadata from the Hive Metastore Server. See Enabling Hive Metadata Extraction in a Secure Cluster.
  • Impala - Extracts database and table metadata from the Hive Metastore Server. Extracts query metadata from the Impala Daemon lineage logs.
  • MapReduce - Extracts job metadata from the JobTracker. The default setting in Cloudera Manager retains a maximum of five jobs, which means if you run more than five jobs between Navigator extractions, the Navigator Metadata Server would extract the five most recent jobs.
  • Oozie - Extracts Oozie workflows from the Oozie Server.
  • Pig - Extracts Pig script runs from the JobTracker or Job History Server.
  • Spark - Extracts Spark job metadata from the YARN logs. Unsupported and disabled by default. To enable, see Enabling Spark Metadata Extraction.
  • Sqoop 1 - Extracts database and table metadata from the Hive Metastore Server. Extracts job runs from the JobTracker or Job History Server.
  • YARN - Extracts job metadata from the ResourceManager.
If an entity is created at time t0 in the system, that entity will be extracted and linked in Navigator after the extraction poll period (default 10 minutes) plus a service-specific interval as follows:
  • HDFS: t0 + extraction poll period + HDFS checkpoint interval (default 1 hour)
  • HDFS + HA: t0 + extraction poll period
  • Hive: t0 + extraction poll period + Hive maximum wait time (default 60 minutes)
  • Impala: t0 + extraction poll period

Metadata Indexing

After metadata is extracted it is indexed and made available for searching by an embedded Solr engine. The Solr schema indexes two types of metadata: entity properties and relationship between entities.

You can search entity metadata using the Navigator UI. Relationship metadata is implicitly visible in lineage diagrams and explicitly available in a lineage file.