Metadata Extraction and Indexing

Metadata Extraction

The Navigator Metadata Server extracts metadata for the following resource types.
Resource Metadata Extraction
Resource Type Metadata Extracted
HDFS HDFS metadata at the next scheduled extraction run after an HDFS checkpoint. If high availability is enabled, metadata is extracted as soon as it is written to the JournalNodes.
Hive Database, table, and query metadata from Hive lineage logs. See Managing Hive and Impala Lineage Properties.
Impala Database, table, and query metadata from the Impala Daemon lineage logs. See Managing Hive and Impala Lineage Properties.
MapReduce Job metadata from the JobTracker. The default setting in Cloudera Manager retains a maximum of five jobs; if you run more than five jobs between Navigator extractions, the Navigator Metadata Server extracts the five most recent jobs.
Oozie Oozie workflows from the Oozie Server.
Pig Pig script runs from the JobTracker or Job History Server.
S3 Bucket and object metadata.
Spark Spark job metadata from YARN logs. (Unsupported and disabled by default. To enable, see Enabling Spark Metadata Extraction.)
Sqoop 1 Database and table metadata from Hive lineage logs; job runs from the JobTracker or Job History Server.
YARN Job metadata from the ResourceManager.

If an entity is created at time t0 in the system, that entity is extracted and linked in Navigator after the extraction poll period (10 minutes by default) plus a service-specific interval, as follows:

  • HDFS: t0 + (extraction poll period) + (HDFS checkpoint interval (1 hour by default))
  • HDFS + HA: t0 + (extraction poll period)
  • Hive: t0 + (extraction poll period) + (Hive maximum wait time (60 minutes by default)
  • Impala: t0 + (extraction poll period)

Metadata Indexing

After metadata is extracted, it is indexed and made available for searching by an embedded Solr engine. The Solr schema indexes two types of metadata: entity properties and relationships between entities.

You can search entity metadata using the Navigator UI and API. Relationship metadata is implicitly visible in lineage diagrams and explicitly available by downloading the lineage using the Cloudera Navigator Data Management API.