Databricks Lineage in Cloudera Octopai Data Lineage

Cloudera Octopai Data Lineage provides automated data lineage and metadata visibility for Databricks environments, supporting Unity Catalog, Hive Metastore, and hybrid configurations. It extracts lineage from system metadata and notebook parsing, offering insights into data flows across pipelines, notebooks, and downstream systems.

Overview

Cloudera Octopai Data Lineage automates data lineage and metadata visibility for Databricks environments. Support for Databricks in Cloudera Octopai accommodates various catalog architectures, including Unity Catalog, Hive Metastore, and hybrid deployments that combine Unity Catalog and Hive Metastore.

Cloudera Octopai extracts lineage directly from Databricks system metadata and enhances it with advanced parsing capabilities for notebook-based workloads. This enables organizations to gain deeper insights into data flows across Databricks pipelines, notebooks, and downstream systems.

Supported Databricks catalog configurations

Cloudera Octopai supports lineage extraction from the following Databricks catalog setups:

Unity Catalog
Unity Catalog is Databricks' centralized governance layer. When Unity Catalog is enabled, Cloudera Octopai extracts lineage using Databricks system lineage tables and catalog metadata, as well as other parsing capabilities.

Unity Catalog is most suitable for customers using it as their primary metastore.

Hive Metastore
For Databricks workspaces still using the legacy Hive Metastore, Cloudera Octopai supports metadata and lineage extraction without relying on Unity Catalog system tables.

Hive Metastore is most suitable for customers not yet migrated to Unity Catalog.

Hybrid: Unity Catalog and Hive Metastore
Some environments operate with both Unity Catalog and Hive Metastore simultaneously. Cloudera Octopai supports these hybrid deployments and can provide lineage across both catalogs, offering broader visibility compared to native Databricks capabilities.

Hybrid configurations are most suitable for customers transitioning from Hive Metastore to Unity Catalog.

Lineage coverage and supported workloads

Cloudera Octopai supports lineage extraction from Databricks environments using either Unity Catalog system metadata, notebook parsing, or a combination of both. Lineage behavior is determined by the catalog configuration of the customer's workspace.

Supported notebook languages

Cloudera Octopai supports SQL, Python, and PySpark notebook lineage consistently across Databricks environments. Lineages for Scala and R are available only when Unity Catalog system metadata provides lineage records.

This table provides an overview of the lineage capabilities available in different Databricks catalog configurations, including Unity Catalog, Hive Metastore, and hybrid (Unity Catalog and Hive Metastore) environments.

Catalog Configuration SQL Python PySpark Scala R
Unity Catalog Supported Supported Supported Supported when Databricks provides lineage Supported when Databricks provides lineage
Hive Metastore Supported Supported Supported Not supported Not supported
Unity Catalog and Hive Metastore (hybrid) Supported Supported Supported Unity Catalog-only when Databricks provides lineage Unity Catalog-only when Databricks provides lineage

Lineage behavior by catalog type

Cloudera Octopai Databricks lineage extraction varies based on whether the workspace uses Unity Catalog, Hive Metastore, or both.

Unity Catalog lineage (including hybrid Unity Catalog and Hive Metastore deployments)

In Unity Catalog environments, Cloudera Octopai integrates system lineage metadata with notebook parsing to deliver comprehensive coverage. It uses two complementary sources:

  • Databricks Unity Catalog system lineage tables
  • Notebook-level parsing for supported scripts, including SQL, Python, and PySpark

This feature provides the following benefits:

  • Authoritative lineage recorded natively by Databricks
  • Additional lineage relationships derived from notebook code analysis

Native Unity Catalog lineage:

Unity Catalog system lineage captures persistent, governed operations, including reads and writes to managed tables.

Operations that do not produce persistent table or storage writes, such as intermediate DataFrame transformations, in-memory processing, or pandas-based manipulations, may not appear in Unity Catalog lineage metadata.

Notebook parsing enhancement:

Cloudera Octopai also parses notebook scripts to enrich lineage coverage, including cases where native system lineage may be incomplete.

Temporary views or non-persistent transformations may not appear in Unity Catalog system metadata, but may still be partially reflected through parsing where possible.

Hybrid (Unity Catalog and Hive Metastore) environments:

In hybrid environments, Cloudera Octopai uses the same Unity Catalog extraction approach while also including Hive Metastore assets. This approach offers broader visibility across both catalog types compared to the Databricks native lineage UI.

Hive Metastore lineage support (Hive Metastore only):

In Hive Metastore-only environments, Cloudera Octopai derives lineage primarily through notebook script parsing.

Cloudera Octopai supports lineage extraction from the following notebook types:

  • Python notebooks
  • PySpark notebooks
  • SQL notebooks

Lineage is determined based on the notebook code.

Script-based parsing limitations:

In Hive Metastore environments, lineage is inferred from notebook scripts, resulting in the following limitations:

  • Highly dynamic transformations (such as code-generated queries, function-driven logic, user-defined functions (UDFs), loops that write files programmatically, or indirect write operations) can limit lineage resolution or prevent full identification of sources and targets..
  • Lineage resolution requires explicit table and column references in the code.
  • If a table is referenced without a fully qualified database or schema name, Cloudera Octopai might not be able to resolve the database context. For example, the query SELECT * FROM sales_table might appear in the lineage without its associated database.
  • Fully qualified references, such as db.schema.sales_table, provide the most complete lineage results.

In Hive Metastore environments, Cloudera Octopai does not currently support lineage for the following cases:

  • Scala notebooks
  • R notebooks
  • Databricks pipelines or jobs

Column-level lineage support (Unity Catalog and Hive Metastore):

Column-level lineage is supported when column mappings are available in the metadata source:

  • In Unity Catalog environments, column lineage is captured primarily from Databricks system lineage tables. It may also be supplemented by notebook parsing if column references are explicitly defined.
  • In Hive Metastore environments, columns are captured only when explicitly referenced in notebook scripts (for example, within SQL queries).

Shared platform limitations

The following limitations apply across both Unity Catalog and Hive Metastore environments:

Pipelines and jobs not represented as entities

Databricks jobs and pipelines are not currently modeled as standalone entities in Cloudera Octopai lineage. In Unity Catalog environments, lineage generated by jobs or pipelines is captured. However, the lineage is reported through the underlying notebook activity instead of appearing as a separate job or pipeline asset.

Cloudera Octopai captures lineage at the notebook execution level and focuses on the underlying data transformations, rather than modeling orchestration constructs (such as jobs or pipelines) as standalone lineage entities.

Unity Catalog environments provide the most complete lineage coverage. Hive Metastore environments rely entirely on notebook parsing and require explicit table and column references.