Accessing Cloud Data
Cloud storage connectors overview
The Cloud Storage Connectors
Working with Amazon S3
Limitations of Amazon S3
Configuring Access to S3
Configuring Access to S3 on CDP Public Cloud
Configuring Access to S3 on CDP Private Cloud Base
Using Configuration Properties to Authenticate
Using Per-Bucket Credentials to Authenticate
Using Environment Variables to Authenticate
Using EC2 Instance Metadata to Authenticate
Referencing S3 Data in Applications
Configuring Per-Bucket Settings
Customizing Per-Bucket Secrets Held in Credential Files
Configuring Per-Bucket Settings to Access Data Around the World
Encrypting Data on S3
SSE-S3: Amazon S3-Managed Encryption Keys
Enabling SSE-S3
SSE-KMS: Amazon S3-KMS Managed Encryption Keys
Enabling SSE-KMS
IAM Role permissions for working with SSE-KMS
SSE-C: Server-Side Encryption with Customer-Provided Encryption Keys
Enabling SSE-C
CSE-KMS: Amazon S3-KMS managed encryption keys
Enabling CSE-KMS
Configuring Encryption for Specific Buckets
Encrypting an S3 Bucket with Amazon S3 Default Encryption
Performance Impact of Encryption
Safely Writing to S3 Through the S3A Committers
Introducing the S3A Committers
Configuring Directories for Intermediate Data
Using the Directory Committer in MapReduce
Verifying That an S3A Committer Was Used
Cleaning up after failed jobs
Using the S3Guard Command to List and Delete Uploads
Advanced Committer Configuration
Enabling Speculative Execution
Using Unique Filenames to Avoid File Update Inconsistency
Speeding up Job Commits by Increasing the Number of Threads
Securing the S3A Committers
The S3A Committers and Third-Party Object Stores
Limitations of the S3A Committers
Troubleshooting the S3A Committers
Security Model and Operations on S3
S3A and Checksums (Advanced Feature)
A List of S3A Configuration Properties
Working with versioned S3 buckets
Working with Third-party S3-compatible Object Stores
Improving Performance for S3A
Working with S3 buckets in the same AWS region
Configuring and tuning S3A block upload
Tuning S3A Uploads
Thread Tuning for S3A Data Upload
Optimizing S3A read performance for different file types
S3 Performance Checklist
Troubleshooting S3
Working with Google Cloud Storage
Configuring Access to Google Cloud Storage
Create a GCP Service Account
Create a Custom Role
Modify GCS Bucket Permissions
Configure Access to GCS from Your Cluster
Manifest committer for ABFS and GCS
Using the manifest committer
Spark Dynamic Partition overwriting
Job summaries in _SUCCESS files
Job cleanup
Working with Google Cloud Storage
Advanced topics
Additional Configuration Options for GCS
Working with the ABFS Connector
Introduction to Azure Storage and the ABFS Connector
Feature Comparisons
Setting up and configuring the ABFS connector
Configuring the ABFS Connector
Authenticating with ADLS Gen2
Configuring Access to Azure on CDP Public Cloud
Configuring Access to Azure on CDP Private Cloud Base
ADLS Proxy Setup
Manifest committer for ABFS and GCS
Using the manifest committer
Spark Dynamic Partition overwriting
Job summaries in _SUCCESS files
Job cleanup
Working with Azure ADLS Gen2 storage
Advanced topics
Performance and Scalability
Hierarchical namespaces vs. non-namespaces
Flush options
Using ABFS using CLI
Hadoop File System commands
Create a table in Hive
Accessing Azure Storage account container from spark-shell
Copying data with Hadoop DistCp
DistCp and Proxy Settings
ADLS Trash Folder Behavior
Troubleshooting ABFS