Accessing a Data Lake from CML

CML can access data tables stored in an AWS or Microsoft Azure Data Lake. As a CML Admin, follow this procedure to set up the necessary permissions.

The instructions apply to Data Lakes on both AWS and Microsoft Azure. Follow the instructions that apply to your environment.

  1. Cloud Provider Setup

    Make sure the prerequisites for AWS or Azure are satisfied (see the Related Topics for AWS environments and Azure environments). Then, create a CDP environment as follows.

    1. For environment logs, create an S3 bucket or ADLS Gen2 container.
    2. For environment storage, create an S3 bucket or ADLS Gen2 container.
    3. For AWS, create AWS policies for each S3 bucket, and create IAM roles (simple and instance profiles) for these policies.
    4. For Azure, create managed identities for each of the personas, and create roles to map the identities to the ADLS permissions.
    For detailed information on S3 or ADLS, see Related information.
  2. Environment Setup
    In CDP, set up paths for logs and native data access to the S3 bucket or ADLS Gen2 container.

    In the Environment Creation wizard, set the following:



    1. Logs Storage and Audits
      1. Instance Profile - The IAM role or Azure identity that is attached to the master node of the Data Lake cluster. The Instance Profile enables unauthenticated access to the S3 bucket or ADLS container for logs.
      2. Logs Location Base - The location in S3 or ADLS where environment logs are saved. .
      3. Ranger Audit Role - The IAM role or Azure identity that has S3 or ADLS access to write Ranger audit events. Ranger uses Hadoop authentication, therefore it uses IDBroker to access the S3 bucket or ADLS container, rather than using Instance profiles or Azure identities directly.
    2. Data Access


      1. Instance Profile - The IAM role or Azure identity that is attached to the IDBroker node of the Data Lake cluster. IDBroker uses this profile to assume roles on behalf of users and get temporary credentials to access S3 buckets or ADLS containers.
      2. Storage Location Base - The S3 or ADLS location where data pertaining to the environment is saved.
      3. Data Access Role - The IAM role or Azure identity that has access to read or write environment data. For example, Hive creates external tables by default in the CDP environments, where metadata is stored in HMS running in the Data Lake. The data itself is stored in S3 or ADLS. As Hive uses Hadoop authentication, it uses IDBroker to access S3 or ADLS, rather than using Instance profiles or Azure identities. Hive uses the data access role for storage access.
      4. ID Broker Mappings - These specify the mappings between the CDP user or groups to the AWS IAM roles or Azure roles that have appropriate S3 or ADLS access. This setting enables IDBroker to get appropriate S3 or ADLS credentials for the users based on the role mappings defined.
      This completes installation of the environment.
  3. User Group Mappings
    In CDP, you can assign users to groups to simplify permissions management. For example, you could create a group called ml-data-scientists, and assign two individual users to it, as shown here. For instructions, see link.


    1. Sync users

      Whenever you make changes to user and group mappings, make sure to sync the mappings with the authentication layer. In User Management > Actions, click Sync Users, and select the environment.



  4. IDBroker
    IDBroker allows an authenticated and authorized user to exchange a set of credentials or a token for cloud vendor access tokens. You can also view and update the IDBroker mappings at this location. IDBroker mappings can be accessed through Environments > Manage Access. Click on the IDBroker Mappings tab. Click Edit to edit or add mappings. When finished, sync the mappings to push the settings from CDP to the IDBroker instance running inside the Data Lake of the environment.


    At this point, CDP resources can access the AWS S3 buckets or Azure ADLS storage.
  5. Ranger
    To get admin access to Ranger, users need the EnvironmentAdmin role, and that role must be synced with the environment.
    1. Click Environments > Env > Actions > Manage Access > Add User
    2. Select EnvironmentAdmin resource role.
    3. Click Update Roles
    4. On the Environments page for the environment, in Actions, select Synchronize Users to FreeIPA.
    The permissions are now synchronized to the Data Lake, and you have admin access to Ranger.
    Update permissions in Ranger
    1. In Environments > Env > Data Lake Cluster, click Ranger.
    2. Select the Hadoop SQL service, and check that the users and groups have sufficient permissions to access databases, tables, columns, and urls.
    For example, a user can be part of these policies:
    • all - database,table,column
    • all - url
    This completes all configuration needed for CML to communicate with the Data Lake.
  6. CML User Setup

    Now, CML is able to communicate with the Data Lake. There are two steps to get the user ready to work.

    1. In Environments > Environment name > Actions > Manage Access > Add user, the Admin selects MLUser resource role for the user.
    2. The User logs into the workspace in ML Workspaces > Workspace name, click Launch Workspace.
    The user can now access the workspace.