Configure Databricks in Cloudera Octopai

Learn how to integrate Databricks with Cloudera Octopai Data Lineage based on your catalog type, including Unity Catalog, Hive Metastore, or hybrid (Unity Catalog and Hive Metastore) deployments.

Before configuring Databricks in Cloudera Octopai, review the prerequisites that apply to your catalog environment. The configuration requirements vary depending on whether you use Unity Catalog, Hive Metastore, or a hybrid deployment combining both.

Requirements for Unity Catalog

Unity Catalog environments require system table access and Databricks SQL connectivity. To extract lineage, Cloudera Octopai must authenticate with a service principal and query Unity Catalog lineage system tables.

Requirements for Hive Metastore

For environments using only Hive Metastore, ensure that the user or machine identity meets the following requirements:

Permission to view and access the workspace folders containing the notebooks.
Read access to the projects or directories selected for metadata extraction.
Can view the relevant Hive Metastore objects referenced by the notebooks.

When working with Hive Metastore, ensure that a cluster is active and running during the extraction process. If the cluster is stopped or unavailable, Spark cannot execute the metadata queries, and metadata retrieval fails.

Requirements for Unity Catalog and Hive Metastore Hybrid Deployments

In hybrid environments, the following requirements must be met:

Unity Catalog prerequisites, including SQL Warehouse access and permissions.
Hive Metastore assets must also be included.
Cloudera Octopai combines metadata sources to provide extended lineage coverage.

Perform the following steps to configure Databricks in Cloudera Octopai:

Create a service principal (required)
Unity Catalog lineage extraction requires a machine identity with access to governed metadata.
You must ensure the following:
- Create a Databricks-managed service principal.
- Enable Workspace access and Databricks SQL access.
Enable or create an SQL Warehouse (required)
Cloudera Octopai relies on querying Databricks system tables, which requires a running SQL Warehouse.
You must ensure the following:
- Create or enable a Databricks SQL Warehouse.
- Allow access to required system schemas.
Perform the following steps:
1. In Databricks, go to the SQL Warehouses tab.
2. If no SQL Warehouse exists,, click Create SQL Warehouse and configure it as required.
3. Assign the service principal Manager permissions to the warehouse by selecting Can use.
4. Open the SQL Warehouse and go to Connection details.
5. Copy the HTTP path. You will need this path for the integration process.
Ensure Unity Catalog–Enabled Compute
- Unity Catalog must be enabled at the workspace/account level.
- A cluster that supports Unity Catalog access must be available.
Grant Unity Catalog Lineage Permissions (required)
The service principal must have SELECT access on the system lineage tables (system.access.table_lineage and system.access.column_lineage) and read access on relevant catalogs and schemas.
1. Open the Catalog in Databricks.
2. Search for:
  - Catalog: system
  - Schema: access
  - Tables: table_lineage and column_lineage
3. The tables are automatically created by Databricks.
  note
  You must have admin permissions to view and manage the tables.
1. For each table, perform the following steps:
  - Open the Permissions tab.
  - Click Grant.
  - Select the service principal created earlier.
  - Enable Select Permission.
Download the ODBC Driver
- Download and install the Simba ODBC Driver for Databricks from the official Databricks download page: https://www.databricks.com/spark/odbc-drivers-download
- Select the appropriate version for your operating system (Windows or Linux).
Collect the required workspace information:
- Workspace URL
- Workspace ID (required)
- Account ID (optional)
1. Find the Databricks Account ID
  The account ID is available in the Databricks Account Console.
  1. Open the account console: https://accounts.cloud.databricks.com/
  2. Log in using your organization's credentials (SSO may be required).
  3. In the top-right corner, select your username/email to open the dropdown menu.
  4. Databricks displays the Account ID as a UUID value, for example: 55eb1a01-48d5-4008-8dbd-03dd8447a595
  5. Copy this value.
2. Find the Databricks Workspace ID
  The workspace ID is embedded directly in your Databricks workspace URL.
  1. Open your Databricks workspace in the browser, for example:
```
https://adb-90442919623923.3.azuredatabricks.net/
```
    or
```
https://adb-90442919623923.3.azuredatabricks.net/?o=90442919623923
```
  2. Locate the parameter ?o= in the URL, for example:
    https://mycompany.cloud.databricks.com/?o=90442919623923 → Workspace ID = 90442919623923
  3. If you do not find the ?o= parameter, go to Sidebar > Data Science & Engineering.
    The URL will update to include the workspace ID:
```
https://mycompany.cloud.databricks.com/?o=90442919623923#workspace/
```
  This verifies the workspace ID value.