Metadata Extraction and Indexing

Metadata Extraction

The Navigator Metadata Server extracts metadata for the following resource types.

Resource Metadata Extraction
Resource Type	Metadata Extracted
HDFS	HDFS metadata at the next scheduled extraction run after an HDFS checkpoint. If high availability is enabled, metadata is extracted as soon as it is written to the JournalNodes.
Hive	Database, table, and query metadata from Hive lineage logs. See Managing Hive and Impala Lineage Properties.
Impala	Database, table, and query metadata from the Impala Daemon lineage logs. See Managing Hive and Impala Lineage Properties.
MapReduce	Job metadata from the JobTracker. The default setting in Cloudera Manager retains a maximum of five jobs; if you run more than five jobs between Navigator extractions, the Navigator Metadata Server extracts the five most recent jobs.
Oozie	Oozie workflows from the Oozie Server.
Pig	Pig script runs from the JobTracker or Job History Server.
S3	Bucket and object metadata.
Spark	Spark job metadata from YARN logs. (Unsupported and disabled by default. To enable, see Enabling Spark Metadata Extraction.)
Sqoop 1	Database and table metadata from Hive lineage logs; job runs from the JobTracker or Job History Server.
YARN	Job metadata from the ResourceManager.

If an entity is created at time t0 in the system, that entity is extracted and linked in Navigator after the extraction poll period (10 minutes by default) plus a service-specific interval, as follows:

HDFS: t0 + (extraction poll period) + (HDFS checkpoint interval (1 hour by default))
HDFS + HA: t0 + (extraction poll period)
Hive: t0 + (extraction poll period) + (Hive maximum wait time (60 minutes by default)
Impala: t0 + (extraction poll period)

Metadata Indexing

After metadata is extracted, it is indexed and made available for searching by an embedded Solr engine. The Solr schema indexes two types of metadata: entity properties and relationships between entities.

You can search entity metadata using the Navigator UI and API. Relationship metadata is implicitly visible in lineage diagrams and explicitly available by downloading the lineage using the Cloudera Navigator Data Management API.

S3 Data Extraction for Navigator

Metadata Search Syntax and Properties