Setting up AWS Glue Catalog with CDP Data Catalog

You must map your Cloudera Data Catalog instance with AWS Glue Catalog.

  1. Enable the entitlement for your Cloudera Data Catalog instance by running the following command on your Cloudera environment. For example:
    $ cdp coreadmin grant-entitlement --entitlement-name
    DATA_CATALOG_ENABLE_AWS_GLUE --account-id {account_id}
  2. You must add relevant permissions in the corresponding AWS account:
    1. Include permission to access Glue Catalog service by editing the policy accordingly.

      Make a note of the Assumer Instance Profile role that you intend to use and include full access authorization for AWS Glue.

      Refer to the following images as a guidance to complete the set up.
    2. Search for the role attached to the Instance Profile of the CDP environment. Use the Instance Profile that you have configured above with Glue related policy in your AWS Environment creation command.
      Use the following examples to setup AWS environment and AWS data lake as part of the Glue setup:
      cdp environments create-aws-environment --profile default --cli-input-json '
      {"environmentName”:”ab-ds-cli-7321”,
       "credentialName”:”cd2d-1234”,
       "Region":"us-region-2”,
       "securityAccess":{-insert the value--"},
       "Authentication":{---insert the value---"},
       "logStorage":{"storageLocationBase":"s3a://demo-e2e-test-state-bucket/ab-ds-cli-7321/logs","instanceProfile":"arn:aws:iam::<xxxxxxxxxxx>:instance-profile/<role-name>"},
       "vpcId":"vpc-0123456”,
       "subnetIds":["subnet-04fe923b902aa5cf2","subnet-099c7a631f0ebed3c"],
       "s3GuardTableName":"dc-pro-cli-7210",
       "Description":"ab-ds-cli-7321",
      "enableTunnel":false,
       "workloadAnalytics":false,
       "freeIpa":{"instanceCountByGroup":1},
       }'
      
      cdp environments set-id-broker-mappings \
      --environment-name "ab-ds-cli-7321" \
      --profile default \
      --set-empty-mappings \
      --data-access-role arn:aws:iam::<xxxxxxxxxxxx>:role/add-role \
      --ranger-audit-role arn:aws:iam::<xxxxxxxxxxxx>:role/add-role
      

      Similarly, while setting up the data lake use the Instance Profile that you configured above with Glue related policy in your data lake creation command:

      cdp datalake create-aws-datalake --profile default --runtime 7.2.12 --cli-input-json '
      {"datalakeName":"ab-ds-cli-7321-sdx",
       "environmentName":"ab-ds-cli-7321",
       "cloudProviderConfiguration":{"instanceProfile":"arn:aws:iam::<xxxxxxxxxxx>:instance-profile/<role-name>","storageBucketLocation":"s3a://demo-e2e-test-state-bucket/ab-ds-cli-7321"},
       "scale":"LIGHT_DUTY",
       }'
      
    3. Navigate to the attached policy for the role.
    4. When you manually create tables in AWS Glue Data Catalog, you must set the fully qualified path for the table location.
      For example: s3://my-aws-server-node-1/something/something.amazonaws.com/dc-pro-721-storage/glue/
  3. 3. You must set up the AWS Glue Data Catalog. For more information, see Populating the Glue Data Catalog. You must select only the CSV format which is currently supported for Cloudera Data Catalog and the delimiter which is used in the data.
  4. While creating tables in AWS Glue Data Catalog manually, set the fully qualified path for location. For example: s3://my-aws-server-node-1/something/something/dc-pro-721-storage/glue/
AWS Glue metadata must be registered with Cloudera Data Catalog.