The S3A committers upload data in the tasks, completing the uploads when the job is
committed.
Amazon AWS still bill you for all data held in this “pending”
state. The hadoop s3guard uploads command can be used to list and cancel
such uploads. However, it is simplest to automate cleanup with a bucket lifecycle
rule.
-
Go to the AWS S3 console:
https://s3.console.aws.amazon.com/
.
-
Find the bucket you are using as a destination of work.
-
Select the Management tab.
-
Select Add a new lifecycle rule.
-
Create a rule “cleanup uploads” with no filter, and without any
“transitions”.
Configure an “Expiration” action of Clean up incomplete multipart
uploads.
-
Select a time limit for outstanding uploads, such as 1 Day.
-
Review and confirm the lifecycle rule
You need to select a limit of how long uploads can be outstanding. For Hadoop
applications, this is the maximum time that either an application can write to the same
file and the maximum time which a job may take. If the timeout is shorter than either of
these, then programs are likely to fail.
Once the rule is set, the cleanup is automatic.