Commands That May Be Slower with Cloud Object Storage
Some commands tend to be significantly slower with than when invoked against HDFS or
other filesystems. This includes renaming files, listing files, find
,
mv
, cp
, and rm
.
Renaming Files
Unlike in a normal filesystem, renaming a directory in an object store usually takes time at least as proportional to the number of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays. Amazon S3's time to rename is proportional the amount of data being renamed, so the larger the files being worked on, the longer it will take. This can become a significant delay.
We recommend that when using the hadoop fs put
and hadoop fs
copyFromLocal
commands, you set the -d
option for a direct upload. For
example:
# Upload a file from the cluster filesystem hadoop fs -put -d /datasets/example.orc s3a://bucket1/datasets/ # Upload a file from the local filesystem hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket1/datasets/ # Create a file from stdin echo "hello" | hadoop fs -put -d -f - s3a://bucket1/datasets/hello.txt
Listing Files
Commands which list many files may to be significantly slower with Object Stores, especially those which scan the entire directory tree:
hadoop fs -count s3a://bucket1/ hadoop fs -du s3a://bucket1/
Our recommendation is to use these sparingly, and avoid when working with buckets/containers with many millions of entries.
Find
The find
command can be very slow on a large store with many directories
under the path supplied.
# Enumerate all files in the bucket hadoop fs -find s3a://bucket1/ -print # List *.txt in the bucket. # Remember to escape the wildcard to stop the bash shell trying to expand it hadoop fs -find s3a://bucket1/datasets/ -name \*.txt -print
Rename
In Amazon S3, the time to rename a file depends on its size. The time to rename a directory depends on the number and size of all files beneath that directory. For WASB, GCS and ADLS, the time to rename is proportionly simply to the number of files. If the a rename operation is interrupted, the object store may in an undefined, with some of the source files renamed, others still in their original paths. There may also be duplicate copies of the data.
hadoop fs -mv s3a://bucket1/datasets s3a://bucket/historical
Copy
The hadoop fs -cp
operation reads each file and then writes it back to the
object store; the time to complete depends on the amount of data to copy, and on the
bandwidth between the local computer and the object store.
As an example, this copy command will perform the copy by downloading all the data and uploading it again.
hadoop fs -cp \ adl://alice.azuredatalakestore.net/current \ adl://alice.azuredatalakestore.net/historical
Note | |
---|---|
The further the VMs are from the object store, the longer the copy process takes. |