Cloud Data Access
Also available as:
PDF
loading table of contents...

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2017-10-30

Abstract

The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included.

Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs.


Contents

1. About This Guide
2. Introducing the Cloud Storage Connectors
3. Getting Started with Amazon S3
About Amazon S3
Limitations of Amazon S3
Configuring Authentication with S3
Using Instance Metadata to Authenticate
Using Configuration Properties to Authenticate
Using Environment Variables to Authenticate
Embedding Credentials in the URL to Authenticate
Defining Authentication Providers
Referencing S3 in the URLs
Configuring Per-Bucket Settings
Configuring Per-Bucket Settings to Access Data Around the World
A List of S3A Configuration Properties
Encrypting Data on S3
SSE-S3: Amazon S3-Managed Encryption Keys
SSE-KMS: Amazon S3-KMS Managed Encryption Keys
SSE-C: Server-Side Encryption with Customer-Provided Encryption Keys
Configuring Encryption for Specific Buckets
Mandating Encryption for an S3 Bucket
Performance Impact of Encryption
Improving Performance for S3
Improving DistCp Performance with S3
Improving Container Allocation Performance for S3
Optimizing HTTP Get Requests for S3
Improving Load-Balancing Behavior for S3
Troubleshooting S3
Authentication Failures
Classpath Related Errors
Connectivity Problems
Errors During Delete or Rename of Files
Errors Related to Visible S3A Inconsistency
Troubleshooting S3-SSE
4. Getting Started with ADLS
Configuring Authentication with ADLS
Using Client Credential
Using Token-Based Authentication
Protecting the Azure Credentials for ADLS with Credential Providers
Referencing ADLS in the URLs
Configuring User and Group Representation
5. Getting Started with WASB
Configuring Authentication with WASB
Protecting the Azure Credentials for WASB with Credential Providers
Referencing WASB in the URLs
Configuring Page Blob Support
Configuring Atomic Folder Rename
Configuring Support for Append API
Configuring Multithread Support
Configuring WASB Secure Mode
Configuring Authorization Support in WASB
6. Accessing Cloud Data in Hive
Exposing Cloud Data as Hive Tables
Populating Partition-Related Information
Analyzing Tables
Improving Hive Performance with S3/ADLS/WASB
7. Accessing Cloud Data in Spark
Committing Output to S3
Improving Spark Performance with S3/ADLS/WASB
Accelerating ORC and Parquet Reads
Accelerating Sequential Reads Through Files in S3
8. Copying Cloud Data with Hadoop
Copying Data with DistCp
Improving Performance for DistCp
Local Space Requirements for Copying to S3
Limitations When Using DistCp with S3
Running FS Shell Commands
Commands That May Be Slower with S3
Operations Unsupported for S3
Deleting Objects on S3
Overwriting Objects on S3
Timestamps on S3
Security Model and Operations on S3