Deleting Files on Cloud Object Stores

The hadoop fs -rm command deletes objects and directories full of objects. If the object store is only eventually consistent, fs ls commands and other accessors may briefly return the details of the now-deleted objects; this is an artifact of object stores which cannot be avoided.

If the filesystem client is configured to copy files to a trash directory, the trash directory is in the container/bucket of the store. The rm operation can be significantly slower, and the deleted files continue to incur storage costs.

To make sure that your deleted files are no longer incurring costs, you can do three things:

Use the the -skipTrash option when removing files: hadoop fs -rm -skipTrash s3a://bucket1/dataset
Regularly use the expunge command to purge any data that has been previously moved to the .Trash directory: hadoop fs -expunge -D fs.defaultFS=s3a://bucket1/
As the expunge command only works with the default filesystem, you need to use the -D option to make the target object store the default filesystem. This will change the default configuration.
Disable the trash mechanism by setting the core-site option fs.trash.interval to 0.

Skipping the trash option is of course dangerous: it exists to stop accidental destruction of data. If the object store provides its own versioning or backup mechanism, this can be used instead.

S3 only: because Amazon S3 is eventually consistent, deleted files may briefly still be visible in listing operations, or readable in operations such as hadoop fs -cat. With S3Guard enabled, the listings will be consistent, and the deleted objects no longer visible there.

​Deleting Files on Cloud Object Stores

Deleting Files on Cloud Object Stores