Learn how to clean up temporary scratch files left in S3 when an Impala daemon stops
unexpectedly during query execution involving S3 spilling.
When queries involve data spilling to S3 and an Impala daemon crashes or is
terminated outside the graceful shutdown period, temporary scratch files may remain
in S3 storage. This task outlines steps to identify and safely delete these
files.
To determine when the action is required, if no active queries are running, but the
directory still contains files, action may be necessary. Additionally, if the
directory has files that do not belong to any running queries and these files have
been present for a significant period, such as several hours, this may indicate the
need for cleanup.
Ensure that there are no ongoing queries accessing the S3 directories.
Determine if action is necessary by verifying no active queries or checking
for files not linked to active queries.
Verify the scratch_dirs configuration path to locate S3 scratch
directories.
Get the path from the scratch_dirs configuration. For example:
s3a://dw-bucket/scratch/impala-test,/opt/impala/scratch/remote-buffer:150G,/opt/impala/scratch/local-scratch:150G
If active queries are running, filter directories associated with running
queries before cleanup.
Get a list of active query IDs from the /admission
page on the coordinator’s WebUI under the "Running queries" section. If
using multiple active coordinators then get active query IDs from each
coordinator’s /admission page.
Use the following script to delete unneeded directories while excluding
active query directories.
This example excludes two queries, you can add more as
needed:
If an Impala daemon shuts down unexpectedly and does not restart on
the original host, it may leave behind scratch files in a directory
named after the hostname. These files remain in remote storage under the
configured remote scratch path and are not automatically removed. If the
host is no longer in use by the Impala daemon and the directory exists
in remote storage, Cloudera recommends
you remove it using the above command.
Temporary scratch files in S3 are removed, freeing up storage.
Active query data remains unaffected by the cleanup process.