Managing Data Storage
Optimizing data storage
Balancing data across disks of a DataNode
Plan the data movement across disks
Parameters to configure the Disk Balancer
Run the Disk Balancer plan
Disk Balancer commands
Erasure coding overview
Understanding erasure coding policies
Comparing replication and erasure coding
Best practices for rack and node setup for EC
Prerequisites for enabling erasure coding
Limitations of erasure coding
Using erasure coding for existing data
Using erasure coding for new data
Advanced erasure coding configuration
Erasure coding CLI command
Erasure coding examples
Increasing storage capacity with HDFS compression
Enable GZipCodec as the default compression codec
Use GZipCodec with a one-time job
Setting HDFS quotas
Set quotas using Cloudera Manager
Configuring heterogeneous storage in HDFS
HDFS storage types
HDFS storage policies
Commands for configuring storage policies
Set up a storage policy for HDFS
Set up SSD storage using Cloudera Manager
Configure archival storage
The HDFS mover command
Balancing data across an HDFS cluster
Why HDFS data becomes unbalanced
Configurations and CLI options for the HDFS Balancer
Properties for configuring the Balancer
Balancer commands
Recommended configurations for the Balancer
Configuring and running the HDFS balancer using Cloudera Manager
Configuring the balancer threshold
Configuring concurrent moves
Recommended configurations for the balancer
Running the balancer
Configuring block size
Cluster balancing algorithm
Storage group classification
Storage group pairing
Block move scheduling
Block move execution
Exit statuses for the HDFS Balancer
HDFS
Optimizing performance
Improving performance with centralized cache management
Benefits of centralized cache management in HDFS
Use cases for centralized cache management
Centralized cache management architecture
Caching terminology
Properties for configuring centralized caching
Commands for using cache pools and directives
Customizing HDFS
Customize the HDFS home directory
Properties to set the size of the NameNode edits directory
Optimizing NameNode disk space with Hadoop archives
Overview of Hadoop archives
Hadoop archive components
Create a Hadoop archive
List files in Hadoop archives
Format for using Hadoop archives with MapReduce
Detecting slow DataNodes
Enable detection of slow DataNodes
Allocating DataNode memory as storage
HDFS storage types
LAZY_PERSIST memory storage policy
Configure DataNode memory as storage
Improving performance with short-circuit local reads
Prerequisites for configuring short-ciruit local reads
Properties for configuring short-circuit local reads on HDFS
Configuring Proxy Users to Access HDFS
Using DistCp to copy files
Using DistCp
Distcp syntax and examples
Using DistCp with Highly Available remote clusters
Using DistCp with Amazon S3
Using a credential provider to secure S3 credentials
Examples of DistCp commands using the S3 protocol and hidden credentials
Kerberos setup guidelines for Distcp between secure clusters
Distcp between secure clusters in different Kerberos realms
Configure source and destination realms in krb5.conf
Configure HDFS RPC protection
Configure acceptable Kerberos principal patterns
Specify truststore properties
Set HADOOP_CONF to the destination cluster
Launch distcp
Copying data between a secure and an insecure cluster using DistCp and WebHDFS
Post-migration verification
Using DistCp between HA clusters using Cloudera Manager
Using the NFS Gateway for accessing HDFS
Configure the NFS Gateway
Start and stop the NFS Gateway services
Verify validity of the NFS services
Access HDFS from the NFS Gateway
How NFS Gateway authenticates and maps users
APIs for accessing HDFS
Set up WebHDFS on a secure cluster
Using HttpFS to provide access to HDFS
Add the HttpFS role
Using Load Balancer with HttpFS
HttpFS authentication
Use curl to access a URL protected by Kerberos HTTP SPNEGO
Data storage metrics
Using JMX for accessing HDFS metrics
Configure the G1GC garbage collector
Recommended settings for G1GC
Switching from CMS to G1GC
HDFS Metrics