Accessing Cloud Data
Copyright © 2012-2018 Hortonworks, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
2018-11-30
Abstract
The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open source platform for storing, processing and analyzing large volumes of data. It is designed to deal with data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the major contributor of code and patches to many of these projects. These projects have been integrated and tested as part of the Hortonworks Data Platform release process and installation and configuration tools have also been included.
Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and completely open source. We sell only expert technical support, training and partner-enablement services. All of our technology is, and will remain free and open source. Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For more information on Hortonworks services, please visit either the Support or Training page. Feel free to Contact Us directly to discuss your specific needs.
Contents
- 1. About This Guide
- 2. The Cloud Storage Connectors
- 3. Working with Amazon S3
- Limitations of Amazon S3
- Configuring Access to S3
- Defining Authentication Providers
- IAM Role Permissions for Working with S3
- Referencing S3 Data in Applications
- Configuring Per-Bucket Settings
- Using S3Guard for Consistent S3 Metadata
- Introduction to S3Guard
- Configuring S3Guard
- Monitoring and Maintaining S3Guard
- Disabling S3Guard and Destroying a S3Guard Database
- Pruning Old Data from S3Guard Tables
- Importing a Bucket into S3Guard
- Verifying that S3Guard is Enabled on a Bucket
- Using the S3Guard CLI
- S3Guard: Operational Issues
- S3Guard: Known Issues
- Safely Writing to S3 Through the S3A Committers
- Introducing the S3A Committers
- Enabling the Directory Committer in Hadoop
- Configuring Directories for Intermediate Data
- Using the Directory Committer in MapReduce
- Enabling the Directory Committer in Spark
- Verifying That an S3A Committer Was Used
- Cleaning up After Failed Jobs
- Using the S3Guard Command to List and Delete Uploads
- Advanced Committer Configuration
- Securing the S3A Committers
- The S3A Committers and Third-Party Object Stores
- Limitations of the S3A Committers
- Troubleshooting the S3A Committers
- Security Model and Operations on S3
- S3A and Checksums (Advanced Feature)
- A List of S3A Configuration Properties
- Encrypting Data on S3
- Improving Performance for S3A
- Working with Third-party S3-compatible Object Stores
- Troubleshooting S3
- 4. Working with ADLS
- 5. Working with WASB
- Configuring Access to WASB
- Protecting the Azure Credentials for WASB with Credential Providers
- Protecting the Azure Credentials for WASB within an Encrypted File
- Referencing WASB in URLs
- Configuring Page Blob Support
- Configuring Atomic Folder Rename
- Configuring Support for Append API
- Configuring Multithread Support
- Configuring WASB Secure Mode
- Configuring Authorization Support in WASB
- 6. Working with Google Cloud Storage
- 7. Accessing Cloud Data in Hive
- 8. Accessing Cloud Data in Spark
- 9. Copying Cloud Data with Hadoop
List of Figures
List of Tables
- 2.1. Cloud Storage Connectors
- 3.1. Authentication Options for Different Deployment Scenarios
- 3.2. S3Guard Configuration Parameters
- 3.3. S3A Fast Upload Configuration Options
- 3.4. S3A Upload Tuning Options
- 6.1. Overview of Configuring Access to Google Cloud Storage
- 6.2. Optional Properties Related to Google Cloud Storage
- 7.1. Improving General Performance
- 7.2. Accelerating ORC Reads in Hive
- 7.3. Accelerating ETL Jobs