Cleaning up After Failed Jobs

The S3A committers upload data in the tasks, completing the uploads when the job is committed.

Amazon AWS still bill you for all data held in this “pending” state. The hadoop s3guard uploads command can be used to list and cancel such uploads. However, it is simplest to automate cleanup with a bucket lifecycle rule.

Go to the AWS S3 console, https://s3.console.aws.amazon.com/ .
Find the bucket you are using as a destination of work.
Select the “management” tab.
Select “add a new lifecycle rule”.
Create a rule “cleanup uploads” with no filter, and without any “transitions”; Configure an “Expiration” action of Clean up incomplete multipart uploads.
Select a time limit for outstanding uploads, such as 1 Day.
Review and confirm the lifecycle rule

You need to select a limit of how long uploads can be outstanding. For Hadoop applications, this is the maximum time that either an application can write to the same file and the maximum time which a job may take. If the timeout is shorter than either of these, then programs are likely to fail.

Once the rule is set, the cleanup is automatic.

​Cleaning up After Failed Jobs

Cleaning up After Failed Jobs