The Cloud Storage Connectors
When deploying CDP clusters on cloud IaaS, you can take advantage of the native integration with the object storage services available on Amazon S3 on AWS . This integration is through cloud storage connectors included with CDP. Their primary function is to help you connect to, access, and work with data the cloud storage services.
The cloud connectors allow you to access and work with data stored in Amazon S3, including but not limited to the following use cases:
-
Collect data for analysis and then load it into Hadoop ecosystem applications such as Hive or Spark directly from cloud storage services.
-
Persist data to cloud storage services for use outside of CDP clusters.
-
Copy data stored in cloud storage services to HDFS for analysis and then copy back to the cloud when done.
-
Share data between multiple CDP clusters – and between various external non-CDP systems.
-
Back up CDP clusters using
distcp
.
The cloud object store connectors are implemented as individual Hadoop modules. The libraries and their dependencies are automatically placed on the classpath.