Working with Amazon S3

The Amazon S3 object store is the standard mechanism to store, retrieve, and share large quantities of data in AWS.

Cloudera recommends using unique S3 bucket names across all endpoints to avoid conflicts with other services in CDP.

The features of Amazon S3 include:

Object store model for storing, listing, and retrieving data.
Support for objects up to 5 terabytes, with many petabytes of data allowed in a single "bucket".
Data is stored in Amazon S3 in buckets which are stored in different AWS regions.
Buckets can be restricted to different users or IAM roles.
Data stored in an Amazon S3 bucket is billed based on the size of data how long it is stored, and on operations accessing this data. In addition, you are billed when you transfer data between regions:
- Data transfers between an Amazon S3 bucket and a cluster running in the same region are free of download charges (except in the special case of buckets in which data is served on a user-pays basis).
- Data downloaded from an Amazon S3 bucket located outside the region in which the bucket is hosted is billed per megabyte.
- Data downloaded from an Amazon S3 bucket to any host over the internet is also billed per-Megabyte.
Data stored in Amazon S3 can be backed up with Amazon Glacier.

The Hadoop client to S3, called "S3A", makes the contents of a bucket appear like a filesystem, with directories, files in the directories, and operations on directories and files. As a result, applications which can work with data stored in HDFS can also work with data stored in S3. However, since S3 is an object store, it has certain limitations that you should be aware of.