Accessing Cloud Data

Also available as:

PDF

Contents

Hortonworks Data Platform

Accessing Cloud Data

Copyright © 2012-2019 Cloudera, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2019-12-17

Abstract

The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included.

Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs.

Contents

1. About This Guide

2. The Cloud Storage Connectors

3. Working with Amazon S3

Limitations of Amazon S3

Configuring Access to S3

Using Instance Metadata to Authenticate
Using Configuration Properties to Authenticate
Using Environment Variables to Authenticate
Embedding Credentials in the URL to Authenticate (Deprecated)

Defining Authentication Providers

Using Temporary Session Credentials
Using Anonymous Login
Protecting S3 Credentials with Credential Providers

IAM Role Permissions for Working with S3

Referencing S3 Data in Applications

Configuring Per-Bucket Settings

Configuring Per-Bucket Settings to Access Data Around the World

Using S3Guard for Consistent S3 Metadata

Introduction to S3Guard
Configuring S3Guard
Monitoring and Maintaining S3Guard
Disabling S3Guard and Destroying a S3Guard Database
Pruning Old Data from S3Guard Tables
Importing a Bucket into S3Guard
Verifying that S3Guard is Enabled on a Bucket
Using the S3Guard CLI
S3Guard: Operational Issues
S3Guard: Known Issues

Safely Writing to S3 Through the S3A Committers

Introducing the S3A Committers
Enabling the Directory Committer in Hadoop
Configuring Directories for Intermediate Data
Using the Directory Committer in MapReduce
Enabling the Directory Committer in Spark
Verifying That an S3A Committer Was Used
Cleaning up After Failed Jobs
Using the S3Guard Command to List and Delete Uploads
Advanced Committer Configuration
Securing the S3A Committers
The S3A Committers and Third-Party Object Stores
Limitations of the S3A Committers
Troubleshooting the S3A Committers

Security Model and Operations on S3

S3A and Checksums (Advanced Feature)

A List of S3A Configuration Properties

Encrypting Data on S3

SSE-S3: Amazon S3-Managed Encryption Keys
SSE-KMS: Amazon S3-KMS Managed Encryption Keys
SSE-C: Server-Side Encryption with Customer-Provided Encryption Keys
Configuring Encryption for Specific Buckets
Mandating Encryption for an S3 Bucket
Performance Impact of Encryption

Improving Performance for S3A

Working with Local S3 Buckets
Configuring and Tuning S3A Block Upload
Optimizing S3A read performance for different file types
Improving Load-Balancing Behavior for S3
S3 Performance Checklist

Working with Third-party S3-compatible Object Stores

Troubleshooting S3

Authentication Failures
Classpath Related Errors
Connectivity Problems
Errors During Delete or Rename of Files
Errors Related to Visible S3 Inconsistency
Troubleshooting Encryption

4. Working with ADLS

Configuring Access to ADLS

Configure Access by Using Client Credential
Configure Access by Using Token-Based Authentication

Protecting the Azure Credentials for ADLS with Credential Providers

Referencing ADLS in URLs

Configuring User and Group Representation

ADLS Proxy Setup

5. Working with WASB

Configuring Access to WASB
Protecting the Azure Credentials for WASB with Credential Providers
Protecting the Azure Credentials for WASB within an Encrypted File
Referencing WASB in URLs
Configuring Page Blob Support
Configuring Atomic Folder Rename
Configuring Support for Append API
Configuring Multithread Support
Configuring WASB Secure Mode
Configuring Authorization Support in WASB

6. Working with Google Cloud Storage

Configuring Access to Google Cloud Storage

Create a GCP Service Account
Create a Custom Role
Modify GCS Bucket Permissions
Configure Access to GCS from Your Cluster

Additional Configuration Options for GCS

7. Accessing Cloud Data in Hive

Hive and S3: The Need for S3Guard
Exposing Cloud Data as Hive Tables
Populating Partition-Related Information
Analyzing Tables
Improving Hive Performance with Cloud Object Stores

8. Accessing Cloud Data in Spark

Using S3 as a Safe and Fast Destination of Work

Improving Spark Performance with Cloud Storage

Improving ORC and Parquet Read Performance
Accelerating S3 Read Performance
Accelerating Azure Read Performance
Putting it All Together: spark-defaults.conf

9. Copying Cloud Data with Hadoop

Running FS Shell Commands

Commands That May Be Slower with Cloud Object Storage
Unsupported Filesystem Operations
Deleting Files on Cloud Object Stores
Overwriting Objects on Amazon S3
Timestamps on Cloud Object Stores

Copying Data with DistCp

Using DistCp with S3
Using DistCp with Azure ADLS and WASB
DistCp and Proxy Settings
Improving DistCp Performance

List of Figures

2.1. HDP Cloud Storage Connector Architecture

List of Tables

2.1. Cloud Storage Connectors
3.1. Authentication Options for Different Deployment Scenarios
3.2. S3Guard Configuration Parameters
3.3. S3A Fast Upload Configuration Options
3.4. S3A Upload Tuning Options
6.1. Overview of Configuring Access to Google Cloud Storage
6.2. Optional Properties Related to Google Cloud Storage
7.1. Improving General Performance
7.2. Accelerating ORC Reads in Hive
7.3. Accelerating ETL Jobs