The Cloud Storage Connectors
When deploying CDP clusters on cloud Infrastructure-as-a-Service (IaaS), you can take advantage of the native integration with the object storage services available on Amazon Simple Storage Service (S3) in AWS, Azure Data Lake Storage (ADLS) Gen2 in Azure, and Google Cloud Storage (GCS) in Google Cloud. This integration is through cloud storage connectors included with CDP. Their primary function is to help you connect to, access, and work with data the cloud storage services.
The Cloud Connectors allow you to access and work with data stored in Amazon S3, ADLS Gen2, and GCS storage services including but not limited to the following use cases:
- Collect data for analysis and then load it into Hadoop ecosystem applications such as Hive or Spark directly from cloud storage services.
- Persist data to cloud storage services for use outside of CDP clusters.
- Copy data stored in cloud storage services to HDFS for analysis and then copy back to the cloud when done.
- Share data between multiple CDP clusters, and between various external non-CDP systems.
- Back up CDP clusters using
distcp
.
The cloud object store connectors are implemented as individual Hadoop modules. The libraries and their dependencies are automatically placed on the classpath.
Amazon S3 is an object storage service. The S3A connector
implements the Hadoop filesystem interface using AWS Java SDK to access the web service, and
provides Hadoop applications with a filesystem view of the buckets. Applications can
manipulate data stored in Amazon S3 buckets with an URL starting with the
s3a://
prefix.
ADLS Gen2 is an object storage service that combines the
features of Azure Blob Storage and ADLS Gen1. ADLS Gen2 is an object store designed for
large scale big-data applications, which can be treated as a hierarchical file system, and
has a security model which matches that of HDFS. Applications can manipulate data store in
ADLS Gen2 with with URLs starting with the abfs://
prefix.
GCS is an object storage service for unstructured data,
which can be accessed through URLs beginning with the gcs://
prefix.
Cloud Storage Service | Connector Description | URL Prefix |
---|---|---|
Amazon S3 | The S3A connector enables reading and writing files stored in the Amazon S3 object store. | s3a:// |
ADLS Gen2 | The ABFS connector enables reading and writing files stored in the ADLS Gen2 object store. | abfs:// |
GCS | The GCS connector supports reading and writing data in GCS. | gcs:// |
Amazon S3 can not be used as a replacement for HDFS as the cluster filesystem in CDP. Amazon S3 can be used as a source and destination of work.