Databricks Lineage in Cloudera Octopai Data Lineage

Overview

Cloudera Octopai Data Lineage automates data lineage and metadata visibility for Databricks environments. Support for Databricks in Cloudera Octopai accommodates various catalog architectures, including Unity Catalog, Hive Metastore, and hybrid deployments that combine Unity Catalog and Hive Metastore.

Cloudera Octopai extracts lineage directly from Databricks system metadata and enhances it with advanced parsing capabilities for notebook-based workloads. This enables organizations to gain deeper insights into data flows across Databricks pipelines, notebooks, and downstream systems.

Supported Databricks catalog configurations

Cloudera Octopai supports lineage extraction from the following Databricks catalog setups:

Unity Catalog: Unity Catalog is Databricks' centralized governance layer. When Unity Catalog is enabled, Cloudera Octopai extracts lineage using Databricks system lineage tables and catalog metadata, as well as other parsing capabilities.
Unity Catalog is most suitable for customers using it as their primary metastore.
Hive Metastore: For Databricks workspaces still using the legacy Hive Metastore, Cloudera Octopai supports metadata and lineage extraction without relying on Unity Catalog system tables.
Hive Metastore is most suitable for customers not yet migrated to Unity Catalog.
Hybrid: Unity Catalog and Hive Metastore: Some environments operate with both Unity Catalog and Hive Metastore simultaneously. Cloudera Octopai supports these hybrid deployments and can provide lineage across both catalogs, offering broader visibility compared to native Databricks capabilities.
Hybrid configurations are most suitable for customers transitioning from Hive Metastore to Unity Catalog.

Lineage coverage and supported workloads

Cloudera Octopai supports lineage extraction from Databricks environments using either Unity Catalog system metadata, notebook parsing, or a combination of both. Lineage behavior is determined by the catalog configuration of the customer's workspace.

Supported notebook languages

Cloudera Octopai supports SQL, Python, and PySpark notebook lineage consistently across Databricks environments. Lineages for Scala and R are available only when Unity Catalog system metadata provides lineage records.

This table provides an overview of the lineage capabilities available in different Databricks catalog configurations, including Unity Catalog, Hive Metastore, and hybrid (Unity Catalog and Hive Metastore) environments.


Catalog Configuration	SQL	Python	PySpark	Scala	R
Unity Catalog	Supported	Supported	Supported	Supported when Databricks provides lineage	Supported when Databricks provides lineage
Hive Metastore	Supported	Supported	Supported	Not supported	Not supported
Unity Catalog and Hive Metastore (hybrid)	Supported	Supported	Supported	Unity Catalog-only when Databricks provides lineage	Unity Catalog-only when Databricks provides lineage

Lineage behavior by catalog type

Cloudera Octopai Databricks lineage extraction varies based on whether the workspace uses Unity Catalog, Hive Metastore, or both.

Unity Catalog lineage (including hybrid Unity Catalog and Hive Metastore deployments)

In Unity Catalog environments, Cloudera Octopai integrates system lineage metadata with notebook parsing to deliver comprehensive coverage. It uses two complementary sources:

Databricks Unity Catalog system lineage tables
Notebook-level parsing for supported scripts, including SQL, Python, and PySpark

This feature provides the following benefits:

Authoritative lineage recorded natively by Databricks
Additional lineage relationships derived from notebook code analysis

Native Unity Catalog lineage:

Unity Catalog system lineage captures persistent, governed operations, including reads and writes to managed tables.

Operations that do not produce persistent table or storage writes, such as intermediate DataFrame transformations, in-memory processing, or pandas-based manipulations, may not appear in Unity Catalog lineage metadata.

Notebook parsing enhancement:

Cloudera Octopai also parses notebook scripts to enrich lineage coverage, including cases where native system lineage may be incomplete.

Temporary views or non-persistent transformations may not appear in Unity Catalog system metadata, but may still be partially reflected through parsing where possible.

Hybrid (Unity Catalog and Hive Metastore) environments:

In hybrid environments, Cloudera Octopai uses the same Unity Catalog extraction approach while also including Hive Metastore assets. This approach offers broader visibility across both catalog types compared to the Databricks native lineage UI.

Hive Metastore lineage support (Hive Metastore only):

In Hive Metastore-only environments, Cloudera Octopai derives lineage primarily through notebook script parsing.

Cloudera Octopai supports lineage extraction from the following notebook types:

Python notebooks
PySpark notebooks
SQL notebooks

Lineage is determined based on the notebook code.

In Hive Metastore-only workspaces, a compute resource must be available to execute metadata queries. The Cluster ID specifies the compute context where these commands execute. The cluster serves as the execution engine that queries the Hive Metastore and returns the results.

If no cluster is running, Spark cannot execute metadata queries, and you cannot retrieve metadata. Therefore, Hive Metastore-only environments require a running cluster to access metadata programmatically.

Script-based parsing limitations:

In Hive Metastore environments, lineage is inferred from notebook scripts, resulting in the following limitations:

Highly dynamic transformations (such as code-generated queries, function-driven logic, user-defined functions (UDFs), loops that write files programmatically, or indirect write operations) can limit lineage resolution or prevent full identification of sources and targets..
Lineage resolution requires explicit table and column references in the code.
If a table is referenced without a fully qualified database or schema name, Cloudera Octopai might not be able to resolve the database context. For example, the query SELECT * FROM sales_table might appear in the lineage without its associated database.
Fully qualified references, such as db.schema.sales_table, provide the most complete lineage results.

In Hive Metastore environments, Cloudera Octopai does not currently support lineage for the following cases:

Scala notebooks
R notebooks
Databricks pipelines or jobs

If you do not supply a running cluster (and Cluster ID), Cloudera Octopai can still generate lineage; however, it cannot retrieve table-level metadata stored in the Hive Metastore (such as table definitions and detailed schema information). As a result, Cloudera Octopai displays lineage with limited table metadata.

Column-level lineage support (Unity Catalog and Hive Metastore):

Column-level lineage is supported when column mappings are available in the metadata source:

In Unity Catalog environments, column lineage is captured primarily from Databricks system lineage tables. It may also be supplemented by notebook parsing if column references are explicitly defined.
In Hive Metastore environments, columns are captured only when explicitly referenced in notebook scripts (for example, within SQL queries).

Shared platform limitations

The following limitations apply across both Unity Catalog and Hive Metastore environments:

Pipelines and jobs not represented as entities

Databricks jobs and pipelines are not currently modeled as standalone entities in Cloudera Octopai lineage. In Unity Catalog environments, lineage generated by jobs or pipelines is captured. However, the lineage is reported through the underlying notebook activity instead of appearing as a separate job or pipeline asset.

Cloudera Octopai captures lineage at the notebook execution level and focuses on the underlying data transformations, rather than modeling orchestration constructs (such as jobs or pipelines) as standalone lineage entities.

Unity Catalog environments provide the most complete lineage coverage. Hive Metastore environments rely entirely on notebook parsing and require explicit table and column references.