Cloudera Navigator Lineage Diagrams

Required Role:

Cloudera Navigator provides an automatic collection and easy visualization of upstream and downstream data lineage to verify reliability. For each data source, it shows, down to the column-level within that data source, what the precise upstream data sources were, the transforms performed to produce it, and the impact that data has on downstream artifacts.

A lineage diagram is a directed graph that depicts an extracted entity and its relations with other entities. A lineage diagram is limited to 3000 entities.

There are two types of lineage diagrams:
  • Template - represents a diagram that is a model for other diagram
  • Instance - represents an instance or execution of a template

Entities

In a lineage diagram, entity types are represented by icons:
HDFS Pig
  • File
  • Directory
  • Table
  • Pig script
  • Pig script execution
Hive and Impala Spark (Unsupported - and disabled by default. To enable, see Enabling Spark Metadata Extraction.)
  • Table
  • Query template
  • Query execution
  • Job template
  • Job execution
MapReduce and YARN Sqoop
  • Job template
  • Job execution
  • Job template
  • Job execution
Oozie    
  • Job template
  • Job execution
   
In the following circumstances the entity type icon will appear as :
  • The entity has not yet been extracted. In this case will eventually be replaced with the correct entity icon after the entity is extracted and linked in Navigator. For information on how long it takes for newly created entities to be extracted, see Metadata Extraction.
  • A Hive entity has been deleted from the system before it could be extracted.

Parent entities are represented by a blue box enclosing other entities. The following lineage diagram illustrates the relations between the YARN job script.pig and Pig script script.pig invoked by the parent Oozie workflow pig-app-hue-script and the source file in the data folder and destination folder upperout:


Relations

Relations between the entities are represented graphically by gray lines, with arrows indicating the direction of the data flow. There are the following types of relations:

Relation Type Description
Data flow Describes a relation between data and a processing activity. For example, between a file and a MapReduce job or vice versa.
Alias Describes an alias relation. For example, from a table to a synonym.
Parent-child Describes a parent-child relation. For example, between a directory and a file.
Logical-physical Describes the relation between a logical entity and its physical entity. For example, between a Hive query and a MapReduce job.
Conjoint Describes a non-directional relation. For example, between a table and an index.
Instance of Describes the relation between a template and its instance. For example, an operation execution is an instance of operation.
Control flow Describes a relation where the source entity controls the data flow of the target entity. For example, between the columns used in an insert clause and the where clause of a Hive query.
Lineage diagrams display the following line types:
  • A solid line () represents a "data flow" relationship, indicating that the columns will appear (possibly transformed) in the output. For example, a solid line will appear between the columns used in a select clause.
  • A dashed line () represents a "control flow" relationship, indicating that the columns determine which rows will flow to the output. For example, a dashed line will appear between the columns used in an insert or select clause and the where clause of a Hive query.
The following query:
SELECT sample_07.description,sample_07.salary FROM sample_07
WHERE ( sample_07.salary > 100000)
ORDER BY sample_07.salary DESC LIMIT 1000
has solid lines between the columns in the select clause and a dashed line between the columns in the where clause:

Manipulating Lineage Diagrams

You can click a icon in a parent entity to display its child entities. For example, you can click the Pig script to display its child tables:


  • To improve the layout of a lineage diagram you can drag and drop entities (in this case data and upperout) located outside a parent box.
  • You can use the mouse scroll wheel or the control to zoom the lineage diagram in and out.
  • You can move an the entire lineage diagram in the lineage pane by pressing the mouse button and dragging it.

Displaying a Template Lineage Diagram

A template lineage diagram contains template entities, such as jobs and queries, that can be instantiated, and the input and output entities to which they are related.

To display a template lineage diagram:
  1. Perform a metadata search.
  2. In the list of results, click an Operation or Query result entry.
  3. Click the Lineage tab. For example, when you click the sample_09 result entry:

    the Search screen is replaced with a Details page that displays the entity property sheet:

    After you click the Lineage tab, the lineage diagram displays:

The selected entity sample_09 appears with a white box as a background.
This example lineage diagram illustrates the relations between a Hive query execution entity and its source and destination tables:

When you click the icon, columns and lines connecting the source and destination columns display:

Displaying an Instance Lineage Diagram

An instance lineage diagram displays instance entities, such as job and query executions, and the input and output entities to which they are related. To display an instance lineage diagram:

  1. Perform a search and click a link of type Operation.
  2. Click a link in the Instances box.
  3. Click the Lineage tab:

Displaying the Template Lineage Diagram for an Instance Lineage Diagram

You can navigate from an instance diagram to its template.

  1. Display an instance lineage diagram.
  2. Click the Details tab.
  3. Click the value of the Template property to go to the instance's template.