Cloud Data Access
Also available as:
PDF
loading table of contents...

Commands That May Be Slower with S3

Some commands tend to be significantly slower with Amazon S3 than when invoked against HDFS or other filesystems. This includes renaming files, listing files, find, mv, cp, and rm.

Renaming Files

Unlike in a normal filesystem, renaming files and directories in an object store usually takes time proportional to the size of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays.

In particular, we recommend that when using the put and copyFromLocal commands, you set the -doption for a direct upload. For example:

# Upload a file from the cluster filesystem
hadoop fs -put -d /datasets/example.orc s3a://bucket1/datasets/

# Upload a file from the local filesystem
hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket1/datasets/

# Create a file from stdin
echo "hello" | hadoop fs -put -d -f - s3a://bucket1/datasets/hello.txt

Listing Files

Commands which list many files tend to be significantly slower with Amazon S3 than when invoked against HDFS or other filesystems. For example:

hadoop fs -count s3a://bucket1/
hadoop fs -du s3a://bucket1/

Find

The find command can be very slow on a large store with many directories under the path supplied.

# Enumerate all files in the bucket
hadoop fs -find s3a://bucket1/ -print

# List *.txt in the bucket.
# Remember to escape the wildcard to stop the bash shell trying to expand it
hadoop fs -find s3a://bucket1/datasets/ -name \*.txt -print

Rename

The time to rename a file depends on its size. The time to rename a directory depends on the number and size of all files beneath that directory. If the operation is interrupted, the object store will be in an undefined state.

hadoop fs -mv s3a://bucket1/datasets s3a://bucket/historical

Copy

The copy operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and on the bandwidth in both directions between the local computer and the object store.

hadoop fs -cp s3a://bucket1/datasets s3a://bucket1/historical
[Note]Note

The further the VMs are from the object store, the longer the copy process takes.