Cloud Data Access
Also available as:
loading table of contents...

Chapter 3. Working with Amazon S3

The following table provides an overview of tasks related to configuring and using HDP with S3. Click on the linked topics to get more information about specific tasks.


If you are looking for data sets to play around, you can use Landsat 8 data sets made available by AWS in a public Amazon S3 bucket called "landsat-pds". For more information, refer to Landsat on AWS.

Amazon S3 object store is the standard mechanism to store, retrieve, and share large quantities of data in AWS.

The features of Amazon S3 include:

  • Object store model for storing, listing, and retrieving data.

  • Support for objects up to 5 terabytes, with many petabytes of data allowed in a single bucket.

  • Data is stored in Amazon S3 in buckets which are stored in different AWS regions.

  • Buckets can restricted to different users or IAM roles.

  • Data stored in an Amazon S3 bucket is billed based on the size of data and based on how long it is stored. In addition, you are billed when you transfer data between regions:

    • Data transfers between an Amazon S3 bucket and a cluster running in the same region are free of download charges (except in the special case of buckets in which data is served on a user-pays basis).

    • Data downloaded from an Amazon S3 bucket located outside the region in which the bucket is hosted is billed per megabyte.

  • Data stored in Amazon S3 can be backed up with Amazon Glacier.

The Hadoop client to S3, called "S3A", makes the contents of a bucket appear like a filesystem, with directories, files in the directories, and operations on directories and files. As a result applications which can work with data stored in HDFS can also work with data stored in S3. However, since S3 is an object store, it has certain limitations that you should be aware of.