Integrating Data Catalog with AWS Glue Data Catalog

Integrating CDP Data Catalog with AWS Glue Catalog enables the users to browse and discover data as well as register data into SDX (through metadata translation or copy), so that it can be used with Data Hubs and other relevant experiences.

While using AWS Glue in Data Catalog, you will be able to experience a complete snapshot metadata view, along with other visible attributes that can power your data governance capabilities.

How integration works

Assuming that the SDX is running in the users’ AWS account (that contains the same AWS account which has Glue DataCatalog and the data that has to be discovered), the credentials with the ExternalDataDiscoveryService (which is hosted in SDX) must be shared, so that these two entities can interact with each other. These credentials are used to launch SDX and other workload clusters on the users’ AWS account.

Prerequisites:
  • You must have full access to AWS Glue Catalog and also have access to the EMR cluster’s Hive Metastore instance.
  • You must set up the CDP.
  • You must have access to your AWS IT Admin and CDP Admin user credentials, which is required to enable CDP to access AWS/EMR managed data in CDP.