Multi-cloud agnostic data lineage solution with Databricks integration
Integrate Cloudera Octopai Data Lineage with Databricks Unity Catalog to deliver automated multi-cloud data lineage across AWS, Azure, and Google Cloud. Learn about how to extract metadata from Databricks Unity Catalog so that Cloudera Octopai Data Lineage can construct lineage, the supported cloud platforms, Unity Catalog object model, configuration options, and validation procedures.
Overview
Cloudera Octopai Data Lineage is a cloud-agnostic, automated data lineage platform that provides cross-system, column-level lineage across both legacy and modern environments. By integrating with Databricks Unity Catalog, Cloudera Octopai extends the native lineage experience with detailed file-level and column-level lineage for a centralized view of data flows across AWS, Azure, and Google Cloud environments.
Supported cloud platforms
- AWS
- Supports services such as S3, Redshift, Glue, and RDS
- Captures data lineage for ingestion, transformation, and storage processes in the AWS ecosystem
- Azure
- Fully integrated with services like Synapse Analytics, Azure Data Factory, and Azure SQL
- Tracks detailed lineage across hybrid environments, whether on-premise or cloud-based
- Google Cloud Platform (GCP)
- Supports BigQuery, Cloud Storage, Dataflow, and other GCP services
- Captures column-level and file-level lineage for end-to-end tracking of data flows within GCP environments
Databricks Unity Catalog structure
- Catalogs
-
- Catalog Name – The catalog is a logical grouping of databases or schemas within Unity Catalog, representing a container for managing metadata and data lineage
- Schemas (Databases)
-
- Schema Name – Logical grouping of tables within a catalog, allowing metadata organization at the schema level
- Lineage Name – Tracks how data flows into and out of the schema objects
- Tables and views
-
- Table Name – The name of the specific table or view
- Table Type – Whether the table is a managed table, external table, or Delta table
- Data Lineage – Shows how data in the table or view is derived or transformed from other source tables
- Files
-
- File Metadata Name – Information about files, for example CSV or Parquet, stored in external sources, capturing schema details and data types
- File Lineage – Captures how data from files is used and transformed within data pipelines
Databricks ETL notebooks with Unity Catalog
-
Notebook metadata with lineage – Unity Catalog automatically captures lineage for reads and writes against managed tables.
-
Data sources and targets – Cloudera Octopai tracks the flow of data sources and destinations processed by ETL notebooks.
Databricks jobs with Unity Catalog
-
Job Metadata with Lineage – Job tasks interacting with Unity Catalog assets contribute lineage metadata for tables and files.
-
Job Tasks with Unity Catalog Integration – Input and output lineage is tracked per job task so you can analyze how Unity Catalog tables are used during executions.
Cloudera Octopai integration options
Option 1: Supporting lineage through Unity Catalog with Cloudera Octopai Client
Follow the Unity Catalog extraction path when you want Cloudera Octopai to harvest lineage from managed Unity Catalog assets. For more information, see Databricks Supporting Lineage through Unity Catalog.
-
Verify the cluster type.
Databricks Unity Catalog requires a Premium license and supports only certain cluster types, including Standard, High Concurrency, and Single-Node clusters. It also requires Databricks Runtime 9.1 LTS or higher for metadata extraction
-
Configure permissions in Databricks.
- Configure permissions using API
management.
Use the Databricks API to manage access permissions. Example API calls: bash Copy code curl --location 'https://adb-90442919623923.3.azuredatabricks.net/api/2.0/unity-catalog/metastores' \ --header 'Authorization: Bearer <token>'To manage schema access: bash Copy code curl --location --request PUT 'https://adb-90442919623923.3.azuredatabricks.net/api/2.0/unity-catalog/metastores/<metastore-id>/systemschemas/access' \ --header 'Authorization: Bearer <token>' - Assign permissions.
- Go to in the Databricks UI.
- Add users or groups that need access and assign permissions, such as
Can View,Can Run,Can Edit, orIs Owner. - Ensure that the Cloudera Octopai service principal has access to the necessary Unity Catalog objects.
- Configure permissions using API
management.
- Enable audit
logs.
Enable audit logs to track metadata access. Example query to monitor Octopai’s access: sql Copy code select * from system.access.audit where user_agent like '%octopai%' - Review the supported features
- Supported
- Multi-language support (Python, SQL, Scala)
- Automatic notebook metadata capture
- Not supported
- File-level lineage unless the files are mapped to a volume
- tables that are not part of
hive_metastore - Delta Live Tables (DLT) streaming patterns are not captured
- Overwrite mode for DataFrame write operations into Unity Catalog is supported
only for Delta tables and not other file formats, and requires the
CREATEorMODIFYprivilege.
- Supported
- Leverage Cloudera Octopai advantages
Cloudera Octopai extends lineage visibility to include detailed file-level and column-level lineage, handling gaps in streaming live table lineage and other unsupported patterns in Unity Catalog. Cloudera Octopai can parse lineage where Unity Catalog does not, particularly with non-standard transformations and files.
Option 2: Building lineage for specific Notebooks
- Identify notebooks
Determine which notebooks require lineage coverage.
- Configure permissions for Notebooks
- Set up permissions for the relevant Notebooks by using the Sharing Permissions dialog in Databricks.
- Assign roles, such as
Can View,Can View,Can Edit, orCan Manage
- Validate permissions
Confirm permission changes in the Databricks UI or using the Databricks API to ensure extraction jobs can read the notebooks.
Setting up Cloudera Octopai for Databricks metadata extraction
-
Connection name – Assign a descriptive name for the Databricks connection in Cloudera Octopai.
-
Databricks server URL – Specify the workspace URL, for example https://abc-1234.5.azuredatabricks.net.
-
Token – Generate an OAuth2 personal access token (PAT) following Databricks guidance and provide it to Cloudera Octopai.
Testing and validation
-
Unity Catalog validation – Run test extractions and confirm that column-level and file-level lineage appears in the Cloudera Octopai interface.
-
Notebook lineage validation – Execute notebook workloads and verify that read and write operations surface within Cloudera Octopai lineage graphs.
-
Error handling – If lineage is missing, review permissions and cluster configuration to ensure Unity Catalog support.
Security considerations
-
Token management – Rotate OAuth2 tokens regularly and scope privileges to the minimum required.
-
IAM roles and managed identities – Use AWS IAM roles or Azure managed identities to avoid embedding static credentials.
Performance and scaling considerations
-
Scaling Unity Catalog – Optimize clusters for high-volume metadata workloads, especially when handling large datasets.
-
Scaling Cloudera Octopai – Allocate sufficient Cloudera Octopai processing resources to ingest and analyze extensive metadata sets without latency.
Key benefits of Cloudera Octopai with Databricks Unity Catalog
Cloudera Octopai delivers automated, cross-system lineage that augments Unity Catalog by filling gaps in file, streaming, and complex transformation coverage. Organizations gain complete visibility across files, tables, and streaming workloads to simplify governance for multi-cloud data ecosystems.
- Databricks architecture overview
-
To ensure accurate data lineage tracking with Unity Catalog in Databricks, it is essential to enable audit logs in the workspace. This process involves querying the system tables to monitor activity associated with data access and operations.
To enable Audit Logs in Databricks, perform the following steps:- Obtain the metastore ID.
This ID is required to manage and access metadata within Unity Catalog.
- List all available schemas.
Retrieve a list of schemas from the metastore to understand which schemas contain data that will be tracked.
- Enable schema access.
Ensure that the appropriate permissions are set for schema access to allow metadata extraction.
- Create a SQL warehouse and run validation
queries.
Execute the following query to retrieve audit logs, filtering for activities initiated by Octopai: sql Copy code SELECT * FROM system.access.audit WHERE user_agent LIKE '%octopai%'
- Obtain the metastore ID.
- Managing permissions in Databricks using the API
- To manage permissions and settings through API calls to enable access to Unity
Catalog, perform the following steps:
- Set up the
metastore.
Make a request to the Databricks API to retrieve the metastore details: bash Copy code curl --location 'https://adb-xxx.azuredatabricks.net/api/2.0/unity-catalog/metastores' \ --header 'Authorization: Bearer <token>' - Enable schema
access.
Once the metastore is configured, enable schema access with the following API request: bash Copy code curl --location --request PUT 'https://adb-xxx.azuredatabricks.net/api/2.0/unity-catalog/metastores/<metastore-id>/systemschemas/access' \ --header 'Authorization: Bearer <token>' - Generate a service principal and secret.
A service principal must be created (or an existing one used) for access control. After creating or selecting the service principal, generate a secret for authentication.
- Grant permissions to the service principal.
Ensure that the service principal is assigned the necessary permissions for each catalog, ensuring seamless integration with Cloudera Octopai.
- Set up the
metastore.
- Generate OAuth tokens for API access
- To generate an OAuth token using client credentials for API call authentication for
Unity Catalog, perform the following steps:
bash Copy code curl --location 'https://adb-xxx.azuredatabricks.net/oidc/v1/token' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'grant_type=client_credentials' \ --data-urlencode 'client_id=<client-id>' \ --data-urlencode 'client_secret=<client-secret>' \ --data-urlencode 'scope=all-apis'Figure 3. OAuth token generation flow
- Unity Catalog lineage features
-
- Supported features
-
Automatic notebook name retrieval – Unity Catalog captures notebook identifiers without additional scripts.
-
Multiple modes – Lineage works in shared and single-user modes.
-
Multi-language lineage – Unity Catalog can track lineage across different programming languages, such as SQL, Python, and Scala.
-
- Not supported features
-
File-level lineage – Files must be mapped to a volume for Unity Catalog to register lineage.
-
Non-HIve metastore tables – Tables outside the hive_metastore are not captured.
-
License requirement – Unity Catalog is available only to Premium tier Databricks customers.
-
DataFrame write limitations – Overwrite operations require Delta tables plus CREATE or MODIFY privileges on the target objects.
-
- Supported features
- Delta Live Tables limitations
- The following patterns are not supported:
- Lineage for Streaming Live Tables
The lineage between a streaming live table and the files it processes is not captured automatically. For example, the creation of a temporary table might not reflect lineage between the input files and the resulting table.
This example shows a lineage with a delta live table.A Delta Live Table can be created to read data from a streaming source, but the lineage of the streaming file will not automatically appear: python Copy code @dlt.table( name="raw_data_table_name", comment="Source data from SQL Server" ) def load_sql_server_data(): table_df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "csv") \ .load("/path/to/source")
- Lineage for Streaming Live Tables
- Cloudera Octopai enhancements
-
Cloudera Octopai enhances lineage tracking for patterns that Unity Catalog does not natively support, particularly for streaming live tables and file-level operations. The Cloudera Octopai engine is capable of parsing lineage data from complex patterns, offering more comprehensive data lineage coverage than what Unity Catalog provides alone.
This example shows that after refreshing streaming data processing, Cloudera Octopai can capture metadata and track lineage more thoroughly.
sql Copy code CREATE OR REFRESH STREAMING LIVE TABLE BrandName_ach_reject_bronze LOCATION 'abfss://datalake/path' AS SELECT *, _metadata.file_modification_time AS receipt_time FROM cloud_files;
