Setting up AWS Glue Catalog with CDP Data Catalog
You must map your Data Catalog instance with AWS Glue Catalog.
-
Enable the entitlement for your Data Catalog instance by running the following
command on your CDP environment. For example:
$ cdp coreadmin grant-entitlement --entitlement-name DATA_CATALOG_ENABLE_AWS_GLUE --account-id {account_id}
-
You must add relevant permissions in the corresponding AWS account:
-
Include permission to access Glue Catalog service by editing the policy
accordingly.
Make a note of the Assumer Instance Profile role that you intend to use and include full access authorization for AWS Glue.
Refer to the following images as a guidance to complete the set up. -
Search for the role attached to the Instance Profile of the CDP
environment. Use the Instance Profile that you have configured above
with Glue related policy in your AWS Environment creation command.
Use the following examples to setup AWS environment and AWS data lake as part of the Glue setup:
cdp environments create-aws-environment --profile default --cli-input-json ' {"environmentName”:”ab-ds-cli-7321”, "credentialName”:”cd2d-1234”, "Region":"us-region-2”, "securityAccess":{-insert the value--"}, "Authentication":{---insert the value---"}, "logStorage":{"storageLocationBase":"s3a://demo-e2e-test-state-bucket/ab-ds-cli-7321/logs","instanceProfile":"arn:aws:iam::<xxxxxxxxxxx>:instance-profile/<role-name>"}, "vpcId":"vpc-0123456”, "subnetIds":["subnet-04fe923b902aa5cf2","subnet-099c7a631f0ebed3c"], "s3GuardTableName":"dc-pro-cli-7210", "Description":"ab-ds-cli-7321", "enableTunnel":false, "workloadAnalytics":false, "freeIpa":{"instanceCountByGroup":1}, }' cdp environments set-id-broker-mappings \ --environment-name "ab-ds-cli-7321" \ --profile default \ --set-empty-mappings \ --data-access-role arn:aws:iam::<xxxxxxxxxxxx>:role/add-role \ --ranger-audit-role arn:aws:iam::<xxxxxxxxxxxx>:role/add-role
Similarly, while setting up the data lake use the Instance Profile that you configured above with Glue related policy in your data lake creation command:
cdp datalake create-aws-datalake --profile default --runtime 7.2.12 --cli-input-json ' {"datalakeName":"ab-ds-cli-7321-sdx", "environmentName":"ab-ds-cli-7321", "cloudProviderConfiguration":{"instanceProfile":"arn:aws:iam::<xxxxxxxxxxx>:instance-profile/<role-name>","storageBucketLocation":"s3a://demo-e2e-test-state-bucket/ab-ds-cli-7321"}, "scale":"LIGHT_DUTY", }'
For more information, see Creating an AWS environment with a medium duty data lake using the CLI. - Navigate to the attached policy for the role.
-
When you manually create tables in AWS Glue Data Catalog, you must set
the fully qualified path for the table
location.
For example:
s3://my-aws-server-node-1/something/something.amazonaws.com/dc-pro-721-storage/glue/
-
Include permission to access Glue Catalog service by editing the policy
accordingly.
-
3. You must set up the AWS Glue Data Catalog. For more information, see Populating the Glue Data Catalog. You
must select only the CSV format which is currently supported for CDP Data
Catalog and the delimiter which is used in the data.
-
While creating tables in AWS Glue Data Catalog manually, set the fully
qualified path for location. For example:
s3://my-aws-server-node-1/something/something/dc-pro-721-storage/glue/