Multi-Cloud Agnostic Data Lineage Solution with Databricks Integration
Cloudera Octopai Data Lineage is a cloud-agnostic, automated data lineage platform that provides cross-system, table-level and column-level lineage across hybrid, legacy, and modern environments.
Overview:
Cloudera Octopai integrates directly with Databricks Unity Catalog, using Databricks' system tables as the authoritative source of lineage metadata.
This guide explains how to configure Databricks Unity Catalog so that Cloudera Octopai can extract lineage for:
- Unity Catalog tables and views
- Delta Live Tables (DLT) pipelines
- SQL queries and transformations
- Notebooks and jobs interacting with Unity Catalog objects
All extraction runs through JDBC/ODBC queries executed on a SQL Warehouse, using the lineage metadata recorded in Unity Catalog's system.access schema.
Databricks Unity Catalog Lineage Model:
Unity Catalog stores lineage metadata in system tables located under:
system.access.table_lineage
system.access.column_lineage
These tables contain lineage events generated when:
- Notebooks read from or write to UC tables
- SQL statements run in SQL Warehouses
- Delta Live Tables pipelines process data
- Jobs execute transformations or orchestrate UC objects
Unity Catalog captures table-level and column-level lineage for SQL, Python, and Scala operations executed against UC-managed assets.
Cloudera Octopai reads this metadata directly from system tables.
How Cloudera Octopai Integrates with Databricks:
Cloudera Octopai uses a JDBC/ODBC extraction method to query Databricks system tables through a SQL Warehouse. This method provides:
- Accurate lineage across notebooks, pipelines, DLT, jobs
- Incremental extraction using Databricks event timestamps
- A 13-month historical backfill on first run
- Consistent and predictable performance for large environments
Notebook metadata is retrieved via Databricks APIs and joined with system-table lineage during post-processing.
Required Permissions in Databricks:
To enable Cloudera Octopai lineage extraction, the Databricks service principal or user must have:
1. SELECT on system.access.table_lineage and system.access.column_lineage
These permissions authorize reading Unity Catalog lineage events.
Why it's required
Cloudera Octopai runs SQL queries such as:
SELECT * FROM system.access.table_lineage WHERE event_date >= <watermark>
Without SELECT, no lineage metadata can be retrieved.
2. SQL Warehouse Permission: "Can Use" (Not "Can View")
Why "Can View" is not enough
"Can View" only allows the user to see the warehouse in the UI. It does not allow:
- connecting to the warehouse
- running SQL
- executing SELECT on system tables
- using JDBC/ODBC
Extraction will fail.
Why "Can Use" is required
"Can Use" grants the ability to:
- connect to the warehouse
- run SQL queries
- execute SELECT on system tables
- perform incremental extraction
Cloudera Octopai must run SQL against the system tables. This is why "Can Use" is mandatory.
3. Workspace Access + Databricks SQL Access (for service principals)
Required for the service principal to authenticate and operate within the workspace.
4. Valid HTTP Path
All lineage extraction runs through JDBC/ODBC, which requires a valid SQL Warehouse HTTP Path. If this value is missing or wrong, extraction cannot start.
New Extraction Method Summary:
System Metadata Extraction (Authoritative Source)
Cloudera Octopai now retrieves lineage from:
system.access.table_lineage
system.access.column_lineage
This provides:
- Accurate table + column lineage
- Better notebook and DLT correlation
- Reduced latency
- No dependency on ephemeral APIs
Delta Live Tables (DLT) Lineage Support
DLT lineage is captured when:
- The workspace is Unity Catalog–enabled
- The system tables contain the DLT pipeline events
- The SQL Warehouse HTTP Path is defined
- SELECT permissions are granted
Cloudera Octopai automatically:
- Extracts pipeline-level and table-level DLT lineage
- Resolves DLT notebooks and their transformations
- Backfills up to 13 months of historical DLT runs
Incremental Extraction Using Composite Watermark
Cloudera Octopai processes lineage incrementally using:
- event_date
- event_time
This reduces load on Databricks and supports high-frequency refresh schedules.
Compatibility Requirements:
| Component | Requirement |
|---|---|
| Cloudera Octopai Client | Version OC-10.0.19 or higher |
| Databricks Workspace | Unity Catalog enabled |
| Authentication | PAT or Service Principal (M2M) |
| Required Permissions | SELECT on system.access.* tables |
| Connectivity | Valid SQL Warehouse HTTP Path |
Validation Steps:
Unity Catalog Lineage Validation
Verify lineage appears in Cloudera Octopai:
- table-to-table
- column-to-column
- notebook and job operations
- DLT pipelines
Notebook Validation
Confirm that read/write operations performed against UC objects appear in Cloudera Octopai.
Error Handling
Check:
- SELECT permissions
- SQL Warehouse "Can Use"
- Valid HTTP Path
- PAT or Secret validity
Security Considerations:
- Rotate PATs or client secrets regularly
- Use service principals for M2M authentication
- Limit permissions to only the UC objects required
- Store credentials securely
Performance and Scaling:
Unity Catalog and SQL Warehouses handle lineage extraction efficiently, but large environments should ensure:
- warehouses have sufficient throughput
- extraction windows run during low-traffic periods
- high-volume environments use incremental mode
Cloudera Octopai is designed to scale with multi-system metadata loads.
How Cloudera Octopai Extends Databricks Beyond Native Lineage:
Unity Catalog provides lineage for UC-managed objects. Cloudera Octopai extends visibility across:
- files used in pipelines
- notebooks not fully captured by UC
- cross-system lineage across ETL, BI, and cloud systems
- transformations and flows outside Databricks
- multi-cloud data flows (AWS, Azure, GCP)
Cloudera Octopai unifies this lineage into a single, cross-system view.
Summary:
With Unity Catalog enabled and the required permissions configured, Cloudera Octopai delivers comprehensive lineage across Databricks SQL, notebooks, jobs, and Delta Live Tables—enhancing Databricks' native lineage with broader cross-system visibility and higher resolution at the file, table, and column level.
