Databricks - Supporting Lineage through Unity Catalog and Real-time Lineage for Specific Notebooks

This guide provides instructions for Cloudera Octopai administrators on setting up metadata extraction from Databricks to build data lineage within Cloudera Octopai. You have two options depending on your needs. You can either enable lineage through Unity Catalog using the Cloudera Octopai Data Lineage Client extraction or apply lineage for specific notebooks.

If you are enabling lineage through Unity Catalog using the Cloudera Octopai Client extraction, make sure your Databricks environment is using a cluster type that supports Unity Catalog. This is essential for extracting metadata using Unity Catalog.

Set up and manage Unity Catalog | Databricks Documentation

In both cases of metadata extraction, ensure that permissions and configurations are correctly set to maintain accurate and comprehensive data lineage within Cloudera Octopai.

To set up the permissions, choose one of the following options and perform the steps for each option:

  1. Option 1: Supporting Lineage through Unity Catalog Using Cloudera Octopai Client Extraction
    1. Ensure you have the correct cluster type.

      Make sure your Databricks environment is using a cluster type that supports Unity Catalog. This is essential for extracting metadata using Unity Catalog.

    2. Configure permissions in Databricks.

      Proper permissions are crucial for allowing Cloudera Octopai to access and extract metadata from your Databricks environment.

      Locate the workspace:

      • Locate the cluster that holds the metadata you want to extract.
      • Open your Databricks workspace with admin privileges.

      Manage permissions:

      • Navigate to the permissions settings in your Databricks workspace.
      • Add users or groups that require access to this metadata.
      • Open the permissions dialog and select Sharing permissions.

      Assign permissions:

      • Add individual users or groups to grant them notebook permissions.
      • Select Add user or Add group.
      • Choose the user or group from the dropdown list.
      • Assign the appropriate permission level, such as Can view, Can run, Can edit, or Is owner.

      It's advisable to add a group and set the permission to Can manage.

      Save and verify:

      • Save your changes and confirm that all permissions are correctly set.
      • Reopen the sharing permissions dialog to review the configured access.
  2. Option 2: Building Lineage for Specific Notebooks
    1. Identify the notebooks for lineage.

      Determine which specific notebooks within your Databricks environment should be included in the data lineage.

    2. Configure permissions for the selected notebooks.

      Access the notebook workspace:

      • Locate the notebook or notebooks you plan to include in the lineage.
      • Open the corresponding Databricks workspace with admin privileges.

      Manage permissions:

      • Navigate to the permissions settings.
      • Add the users or groups that require access.
      • Select Sharing permissions to open the permissions dialog.

      Assign permissions:

      • Add the required users or groups through the sharing dialog.
      • Choose the appropriate entity from the dropdown list.
      • Assign a permission level such as Can view, Can run, Can edit, or Is owner.

      Save and verify:

      • Save the permissions settings and double-check the configuration for accuracy.
  3. Set up the Databricks Metadata Source.
    1. Assign a meaningful name for the connection as it will appear to users on the Cloudera Octopai platform.
    2. Enter the customer Databricks server URL, for example, https://abc-1234.5.azuredatabricks.net.
    3. Provide the workspace token generated on the Databricks server. For instructions, see Databricks personal access token authentication.