Cloudera Navigator Provenance Use Case

A number of business decisions and transactions rely on the verifiability of the data used in those decisions and transactions. Data-verification questions might include:
  • How was this mortgage credit score computed?
  • How can I prove that this number on a sales report is correct?
  • What data sources were used in this calculation?
You can use Cloudera Navigator to answer these and other questions about your data. Using metadata and lineage, you can get track the life of the data to verify its provenance—that is, determine its origin.

How Can I Verify a Value in a Table?

A number of business transactions require you to verify that information is correct and that it is derived from a reliable source. For example, if you work in a sales orgainization, you might verify that information in sales reports is accurate, that you can trust the contents, and that you can identify the origin of the information.

The following example shows how you can verify information in a field named s_neighbor by tracing it to its source. You will replace the fields and other information in this example with the actual information that you want to verify.
  1. Log into the Cloudera Navigator data management UI and click the Search tab.
  2. Type s_neighbor in the search box.

    You see four instances of the s_neighbor field.

  3. View details of the field in the top_10 table by clicking s_neighbor in the entry with the Parent Path /default/top10.

    You see that the parent table is top_10, and the input or upstream source of the data is the salesdata database.

    Where did salesdata come from originally? It was imported using sqoop, with syntax similar to the following; actual arguments vary:
    > sqoop import-all-tables
        -m {{cluster_data.worker_node_hostname.length}} \
        --connect jdbc:mysql://{{cluster_data.manager_node_hostname}}:3306/retail_db \
        --username=admin \
        --password=password \
        --compression-codec=snappy \
        --as-parquetfile \
        --warehouse-dir=/user/hive/warehouse \
        --hive-import
  4. To see a graphical representation of the relationships among the entities:
    1. Click the Lineage tab.
    2. In Lineage Options, select Operations and clear any other check boxes.

      See that s_neighbor can be traced back to the original table salesdata.

  5. Click the operation entity in the center of the lineage diagram, and see details about it on the lower right side of the lineage window.

    Information about the selected entity indicates that the operation is an Impala query. Click the information icon on the Query Text line to see the entire query. This query was used to derive top_10 from the original table.