Chapter 2. The Cloud Storage Connectors

When deploying HDP clusters on cloud IaaS, you can take advantage of the native integration with the object storage services available on each of the cloud platforms: Amazon S3 on AWS, ADLS and WASB on Azure, and GCS on Google Cloud. This integration is via cloud storage connectors included with HDP. Their primary function is to help you connect to, access, and work with data the cloud storage services.

The cloud connectors allow you to access and work with data stored in Amazon S3, Azure ADLS and Azure WASB storage services, and Google Cloud Storage, including but not limited to the following use cases:

Collect data for analysis and then load it into Hadoop ecosystem applications such as Hive or Spark directly from cloud storage services.
Persist data to cloud storage services for use outside of HDP clusters.
Copy data stored in cloud storage services to HDFS for analysis and then copy back to the cloud when done.
Share data between multiple HDP clusters – and between various external non-HDP systems.
Back up Hadoop clusters using distcp.

The cloud object store connectors are implemented as individual Hadoop modules. The libraries and their dependencies are automatically placed on the classpath.

Figure 2.1. HDP Cloud Storage Connector Architecture

Amazon S3 is an object store. The S3A connector implements the Hadoop filesystem interface using AWS Java SDK to access the web service, and provides Hadoop applications with a filesystem view of the buckets. Applications can manipulate data stored in Amazon S3 buckets with an URL starting with the s3a:// prefix.

Azure WASB is an object store with a flat name architecture (flat name space). The WASB connector implements the Hadoop filesystem interface using WASB Java SDK to access the web service, and provides Hadoop applications with a filesystem view of the data. Applications can manipulate data stored in WASB with an URL starting with the wasb:// prefix.

Azure ADLS is a WebHDFS-compatible hierarchical file system. Applications can access the data in ADLS directly using WebHDFS REST API.

Similarly, the ADLS connector implements the Hadoop filesystem interface using the ADLS Java SDK to access the web service. Applications can manipulate data stored in ADLS with the URL starting with the adl:// prefix.

Google Cloud Storage (GCS) is an object store. The GCS connector implements the Hadoop filesystem interface using GCS Java SDK to access the web service, and provides Hadoop applications with a filesystem view of the buckets. Applications can manipulate data stored in Google Cloud Storage buckets with an URL starting with the gcs:// prefix.

Table 2.1. Cloud Storage Connectors

Cloud Storage Service	Connector Description	URL Prefix
Amazon Simple Storage Service (S3)	The S3A connector enables reading and writing files stored in the Amazon S3 object store.	s3a://
Azure Data Lake Store (ADLS)	The ADLS connector enables reading and writing files stored in the ADLS file system.	adl://
Windows Azure Storage Blob (WASB)	The WASB connector enables reading and writing both block blobs and page blobs from and to WASB object store.	wasb://
Google Cloud Storage (GCS)	The GCS connector supports access to Google Cloud Storage.	gcs://

The cluster's default filesystem HDFS is defined in the configuration property fs.defaultFS in core-site.xml. As a result, when running FS shell commands or DistCp against HDFS, you can but do not need to specify the hdfs:// URL prefix:

hadoop distcp hdfs://source-folder s3a://destination-bucket

hadoop distcp /source-folder s3a://destination-bucket

When working with the cloud using cloud URIs do not change the value of fs.defaultFS to use a cloud storage connector as the filesystem for HDFS. This is not recommended or supported. Instead, when working with data stored in S3, ADLS, WASB, or GCS use a fully qualified URL for that connector.

​Chapter 2. The Cloud Storage Connectors

Chapter 2. The Cloud Storage Connectors