The Cloud Storage Connectors

When deploying CDP clusters on cloud Infrastructure-as-a-Service (IaaS), you can take advantage of the native integration with the object storage services available on Amazon Simple Storage Service (S3) in AWS, Azure Data Lake Storage (ADLS) Gen2 in Azure, and Google Cloud Storage (GCS) in Google Cloud. This integration is through cloud storage connectors included with CDP. Their primary function is to help you connect to, access, and work with data the cloud storage services.

The Cloud Connectors allow you to access and work with data stored in Amazon S3, ADLS Gen2, and GCS storage services including but not limited to the following use cases:

Collect data for analysis and then load it into Hadoop ecosystem applications such as Hive or Spark directly from cloud storage services.
Persist data to cloud storage services for use outside of CDP clusters.
Copy data stored in cloud storage services to HDFS for analysis and then copy back to the cloud when done.
Share data between multiple CDP clusters, and between various external non-CDP systems.
Back up CDP clusters using distcp.

The cloud object store connectors are implemented as individual Hadoop modules. The libraries and their dependencies are automatically placed on the classpath.

Cloud Storage Service	Connector Description	URL Prefix
Amazon S3	The S3A connector enables reading and writing files stored in the Amazon S3 object store.	`s3a://`
ADLS Gen2	The ABFS connector enables reading and writing files stored in the ADLS Gen2 object store.	`abfs://`
GCS	The GCS connector supports reading and writing data in GCS.	`gcs://`