Metadata Extraction and Indexing
Metadata Extraction
The Navigator Metadata Server extracts metadata for the following resource types.
Resource Type | Metadata Extracted |
---|---|
HDFS | HDFS metadata at the next scheduled extraction run after an HDFS checkpoint. If high availability is enabled, metadata is extracted as soon as it is written to the JournalNodes. |
Hive | Database, table, and query metadata from Hive lineage logs. See Managing Hive and Impala Lineage Properties. |
Impala | Database, table, and query metadata from the Impala Daemon lineage logs. See Managing Hive and Impala Lineage Properties. |
MapReduce | Job metadata from the JobTracker. The default setting in Cloudera Manager retains a maximum of five jobs; if you run more than five jobs between Navigator extractions, the Navigator Metadata Server extracts the five most recent jobs. |
Oozie | Oozie workflows from the Oozie Server. |
Pig | Pig script runs from the JobTracker or Job History Server. |
S3 | Bucket and object metadata. |
Spark | Spark job metadata from YARN logs. (Unsupported and disabled by default. To enable, see Enabling Spark Metadata Extraction.) |
Sqoop 1 | Database and table metadata from Hive lineage logs; job runs from the JobTracker or Job History Server. |
YARN | Job metadata from the ResourceManager. |
If an entity is created at time t0 in the system, that entity is extracted and linked in Navigator after the extraction poll period (10 minutes by default) plus a service-specific interval, as follows:
- HDFS: t0 + (extraction poll period) + (HDFS checkpoint interval (1 hour by default))
- HDFS + HA: t0 + (extraction poll period)
- Hive: t0 + (extraction poll period) + (Hive maximum wait time (60 minutes by default)
- Impala: t0 + (extraction poll period)
Metadata Indexing
After metadata is extracted, it is indexed and made available for searching by an embedded Solr engine. The Solr schema indexes two types of metadata: entity properties and relationships between entities.
You can search entity metadata using the Navigator UI and API. Relationship metadata is implicitly visible in lineage diagrams and explicitly available by downloading the lineage using the Cloudera Navigator Data Management API.