Cloudera Navigator Lineage Diagrams

Minimum Required Role: Lineage Administrator (also provided by Metadata Administrator, Full Administrator)

Cloudera Navigator provides an automatic collection and easy visualization of upstream and downstream data lineage to verify reliability. For each data source, it shows, down to the column level within that data source, what the precise upstream data sources were, the transforms performed to produce it, and the impact that data has on downstream artifacts.

A lineage diagram is a directed graph that depicts an extracted entity and its relations with other entities. A lineage diagram is limited to 400 entities. Once that limit is reached, certain entities display as a "hidden" icon.

Entities

In a lineage diagram, entity types are represented by icons:
HDFS Pig
  • File
  • Directory
  • Table
  • Pig field
  • Pig operation, operation execution
Hive and Impala Spark (Supported in CDH 5.11 and higher.) Spark Lineage information is produced only for data that is read/written and processed using the Dataframe and SparkSQL APIs. Lineage is not available for data that is read/written or processed using Spark's RDD APIs. To turn metadata extraction off or on, see Enabling and Disabling Metadata Extraction.)
  • Table
  • Field
  • Operation, suboperation, execution
  • Impala operation, suboperation, execution
  • Operation, operation execution. (Spark RDDs and aggregation operations are not included in the diagrams.)
MapReduce and YARN Sqoop
  • MapReduce operation and operation execution
  • YARN operation and operation execution
  • Operation, suboperation, execution
Oozie S3  
  • Operation, operation execution
  • Directory
  • File
  • S3 Bucket
Hidden      
See Viewing the Lineage of Hidden Entities.    
In the following circumstances, the entity type icon appears as :
  • The entity has not yet been extracted. In this case, is eventually replaced with the correct entity icon after the entity is extracted and linked in Navigator. For information on how long it takes for newly created entities to be extracted, see Metadata Extraction.
  • A Hive entity has been deleted from the system before it could be extracted.

The following lineage diagram illustrates the relations between the YARN operation DefaultJobName and Pig script DefaultJobNameand the source file in the ord_us_gcb_crd_crs-fdr-sears folder and destination folder tmp137071676:


Relations

Relations between the entities are represented graphically by lines, with arrows indicating the direction of the data flow. Navigator supports the following types of relations:

Relation Type Description
Data flow Describes a relation between data and a processing activity; for example, between a file and a MapReduce job or vice versa.
Parent-child Describes a parent-child relation. For example, between a directory and a file.
Logical-physical Describes the relation between a logical entity and its physical entity. For example, between a Hive query and a MapReduce job.
Instance of Describes the relation between a template and its instance. For example, an operation execution is an instance of operation. Instance of relations are never visualized in the lineage, however you can navigate between template and instance lineage diagrams. See Displaying an Instance Lineage Diagram and Displaying the Template Lineage Diagram for an Instance Lineage Diagram.
Control flow Describes a relation where the source entity controls the data flow of the target entity. For example, between the columns used in an insert clause and the where clause of a Hive query.
Lineage diagrams contain the following line types:
  • Solid () represents a "data flow" relationship, indicating that the columns appear (possibly transformed) in the output (when directional with arrow) and "logical- physical" (when no arrow). For example, a solid line appears between the columns used in a select clause.
  • Dashed () represents a "control flow" relationship, indicating that the columns determine which rows flow to the output. For example, a dashed line appears between the columns used in an insert or select clause and the where clause of a Hive query. Control flow lines are hidden by default. See Filtering Lineage Diagrams.
  • Blue () represents a selected link.
  • Green () represents a summary link that contains operations. When you click the link, the link turns blue (for selected) and the nested operations display in the selected link summary:

The following query:
SELECT sample_07.description,sample_07.salary FROM sample_07
WHERE ( sample_07.salary > 100000)
ORDER BY sample_07.salary DESC LIMIT 1000
has solid, directed lines between the columns in the select clause and a dashed line between the columns in the where clause:

Manipulating Lineage Diagrams

Expanding Entities

You can click a icon in a parent entity to display its child entities. For example, you can click an Oozie job to display its child Pig script and the Pig script to display its child tables:


Modifying Lineage Layout

  • To improve the layout of a lineage diagram, you can drag entities (like tmp137071676) located outside a parent box.
  • Use the mouse scroll wheel or the

    control to zoom the lineage diagram in and out.
  • You can move an entire lineage diagram in the lineage pane by pressing the mouse button and dragging it.

Viewing the Lineage of Hidden Entities

Lineage that is not fully traversed (that is, you do not see a subset of the actual lineage) is illustrated by the

icon. This icon displays when the lineage diagram has more than 400 entities. For example:

To view the lineage of hidden entities, select the hidden entity and click view the lineage in the box on the right to display a new lineage centered around that entity. After clicking the link, you would see the following:

Filtering Lineage Diagrams

To reduce the time and resources required to render large lineage diagrams, you can filter out classes of entities and links by selecting checkboxes in the Lineage Options box on the right of the diagram. The following are the default selections:

The Only Upstream/Downstream filter allows you to filter out entities and links that are input (upstream) to and output (downstream) from another entity.

Use the Latest Partition and Operation filter to reduce rendering time when you have similar partitions created and operations performed periodically. For example, if Hive partitions are created daily, the filter allows you to display only the latest partition.

Filter Example

If you display the lineage of the sample_09 table with no filtering options selected (other than hiding deleted items), the lineage appears as follows.

Subsequent diagrams show the result of using each supported filter type:
  • Control Flow Relations - The operation is collapsed and control flow links are hidden.

  • Show Upstream and Show Downstream - The operation is collapsed and only upstream entities and links are shown. The output table is hidden.

    Here, the operation is collapsed and only downstream entities and links are shown. The input tables are hidden.

  • Operations - In the diagram, the operation is hidden.

    The green links indicate that one or more operations are collapsed into the links.
  • Deleted Entities - Here, the operation is hidden but deleted entities are displayed.

Searching a Diagram

You can search a lineage diagram for an entity by doing the following:
  1. In the Search box at the right of the diagram, type an entity name. A list of matching entities displays below the box.
  2. Click an entity in the list. A blue box is drawn around the entity and the entity details display in a box below the Search box.

  3. Click the Show link next to the entity. The selected entity moves to the center of the diagram.

  4. Optionally, click the View Lineage link in the entity details box to view the lineage of the selected entity.

Displaying a Template Lineage Diagram

A template lineage diagram contains template entities, such as jobs and queries, that can be instantiated, and the input and output entities to which they are related.

To display a template lineage diagram:
  1. Perform a metadata search.
  2. In the list of results, click an entity. The entity Details page displays. For example, when you click the sample_09 result entry:

    the Search screen is replaced with a Details page that displays the entity property sheet:

  3. Click the Lineage tab. For example, clicking the Lineage tab for the sample_09 table displays the following lineage diagram:

This example shows the relations between a Hive query execution entity and its source and destination tables:

When you click the icon, columns and lines connecting the source and destination columns display:

Displaying an Instance Lineage Diagram

An instance lineage diagram displays instance entities, such as job and query executions, and the input and output entities to which they are related. To display an instance lineage diagram:

  1. Perform a search and click a link of type Operation.
  2. Click a link in the Instances box.
  3. Click the Lineage tab.

Displaying the Template Lineage Diagram for an Instance Lineage Diagram

To browse from an instance diagram to its template:

  1. Display an instance lineage diagram.
  2. Click the Details tab.
  3. Click the value of the Template property to go to the instance's template.