Chapter 3. Getting Started with Amazon S3 - Hortonworks Data Platform

Cloud Data Access

Also available as:

PDF

Contents

loading table of contents...

Chapter 3. Getting Started with Amazon S3

The following table provides an overview of tasks related to configuring and using HDP with S3. Click on the linked topics to get more information about specific tasks.

	Note
	If you are looking for data sets to play around, you can use Landsat 8 data sets made available by AWS in a public Amazon S3 bucket called "landsat-pds". For more information, refer to Landsat on AWS.

Task	Description
Meet the prerequisites	To use S3 storage, you must have: An AWS account. One or more S3 buckets on your AWS account. For instructions on how to create a bucket on S3, refer to AWS documentation.
Configure authentication	In order for Hadoop applications to access data stored in your private S3 buckets, you must configure authentication with your Amazon S3 account.
Configure optional features: Configuring Per-Bucket Settings A List of S3A Configuration Properties	You can optionally configure additional features such as bucket-specific settings.
Work with S3 data: Referencing S3 in the URLs Access data with Hive or Spark Copy data with DistCp	Once you've configured authentication with your S3 bucket(s), you can access S3 data from Hive (via external tables) and Spark, and perform related tasks such as copying data between HDFS and S3 when needed.
Encrypting Data on S3	You can optionally work with S3 data that is protected with server-side encryption: SSE-S3, SSE-KMS, or SSE-C.
Improving Performance for S3	You can optionally configure and fine-tune performance-related features to optimize HDP performance for specific tasks including accessing S3 data from Hive, Spark, and copying data with DistCp.
Troubleshoot	Refer to this section if you experience issues while configuring or using S3 with HDP.