Configure Databricks Metadata Source in Cloudera Octopai

Learn how to configure the Databricks Metadata Source in Cloudera Octopai using either user authentication with Personal Access Tokens or machine-to-machine authentication with service principals.

Cloudera Octopai Data Lineage supports two authentication methods for connecting to Databricks:

  • User authentication using a Personal Access Token
  • Machine-to-machine (M2M) authentication using a service principal

Option 1: User authentication token (Personal Access Token)

Figure 1. Databricks metadata source configuration for Hive Metastore using Personal Access Token

Configuration form showing Databricks metadata source settings for Hive Metastore only, including fields for connection name, server URL, Personal Access Token, and Cluster ID
Figure 2. Databricks metadata source configuration for Unity Catalog using Personal Access Token

Configuration form showing Databricks metadata source settings for Unity Catalog, including fields for connection name, server URL, Personal Access Token, HTTP path, workspace ID, and account ID
Configure the following settings when using the Personal Access Token authentication method:
  1. Unity Catalog Options
    • HMS only – when Databricks uses Hive Metastore without Unity Catalog.
    • Unity Catalog (can contain HMS) – when Databricks uses Unity Catalog. Hive Metastore can also be used (not mandatory).
  2. Connection Name

    Assign a clear and meaningful name for the connection. This name will appear to users within the Cloudera Octopai platform.

  3. Databricks Server URL

    Enter the customer's Databricks workspace URL.

    Example: https://abc-1234.5.azuredatabricks.net

  4. Token

    Enter the Personal Access Token generated under Settings > Developer > Access Tokens (Manage) in Databricks.

  5. HTTP Path (for Unity Catalog only)

    Paste the HTTP Path copied from the Databricks SQL Warehouse > Connection Details field.

    Example: /sql/1.0/warehouses/abc123xyz

  6. Workspace ID (for Unity Catalog only)
  7. Account ID (for Unity Catalog only, optional)
  8. Cluster ID (for Hive Metastore only, optional)

    The Cluster ID identifies the compute context where metadata queries run. Cloudera recommends that you provide a running cluster for full metadata extraction.

    If you do not supply a running cluster (and Cluster ID), Cloudera Octopai can still generate lineage; however, it cannot retrieve table-level metadata stored in the Hive Metastore (such as table definitions and detailed schema information). As a result, Cloudera Octopai displays lineage with limited table metadata.

    To retrieve the Cluster ID:

    1. Navigate to Compute.
    2. Select the relevant cluster.
    3. Copy the Cluster ID from the URL.

    Example:

    https://abc-1234.5.azuredatabricks.net/compute/clusters/123-11568975-2zabcde?o=71234567896

    The Cluster ID in this example is 123-11568975-2zabcde.

Option 2: Machine-to-machine authentication (service principal)

Figure 3. Databricks metadata source configuration for Hive Metastore using service principal

Configuration form showing Databricks metadata source settings for Hive Metastore only using M2M authentication, including fields for connection name, server URL, client ID, client secret, and Cluster ID
Figure 4. Databricks metadata source configuration for Unity Catalog using service principal

Configuration form showing Databricks metadata source settings for Unity Catalog using M2M authentication, including fields for connection name, server URL, client ID, client secret, HTTP path, workspace ID, and account ID
Configure the following settings when using the service principal authentication method:
  1. Unity Catalog Options
    • HMS only – when Databricks uses Hive Metastore without Unity Catalog.
    • Unity Catalog (may include HMS) – when Databricks uses Unity Catalog. Hive Metastore can also be used but is not mandatory.
  2. Connection Name

    Assign a clear and meaningful name for the connection. This name will appear to users within the platform.

  3. Databricks Server URL

    Enter the customer's Databricks workspace URL.

    Example: https://abc-1234.5.azuredatabricks.net

  4. Client ID

    Enter the Client ID of the service principal created in Databricks.

  5. Client Secret

    Enter the secret token generated for the service principal.

  6. HTTP Path (for Unity Catalog only)

    Paste the HTTP Path copied from the Databricks SQL Warehouse > Connection Details field.

    Example: /sql/1.0/warehouses/abc123xyz

  7. Workspace ID (for Unity Catalog only)
  8. Account ID (for Unity Catalog only, optional)
  9. Cluster ID (for Hive Metastore only, optional)

    The Cluster ID identifies the compute context where metadata queries run. Cloudera recommends that you provide a running cluster for full metadata extraction.

    If you do not supply a running cluster (and Cluster ID), Cloudera Octopai can still generate lineage; however, it cannot retrieve table-level metadata stored in the Hive Metastore (such as table definitions and detailed schema information). As a result, Cloudera Octopai displays lineage with limited table metadata.

    To retrieve the Cluster ID:

    1. Navigate to Compute.
    2. Select the relevant cluster.
    3. Copy the Cluster ID from the URL.

    Example:

    https://abc-1234.5.azuredatabricks.net/compute/clusters/123-11568975-2zabcde?o=71234567896

    The Cluster ID in this example is 123-11568975-2zabcde.