Managing temporary scratch files in S3

Learn how to clean up temporary scratch files left in S3 when an Impala daemon stops unexpectedly during query execution involving S3 spilling.

When queries involve data spilling to S3 and an Impala daemon crashes or is terminated outside the graceful shutdown period, temporary scratch files may remain in S3 storage. This task outlines steps to identify and safely delete these files.

To determine when the action is required, if no active queries are running, but the directory still contains files, action may be necessary. Additionally, if the directory has files that do not belong to any running queries and these files have been present for a significant period, such as several hours, this may indicate the need for cleanup.

  • Ensure that there are no ongoing queries accessing the S3 directories.
  • Determine if action is necessary by verifying no active queries or checking for files not linked to active queries.
  • Verify the scratch_dirs configuration path to locate S3 scratch directories.
  1. Get the path from the scratch_dirs configuration. For example: s3a://dw-bucket/scratch/impala-test,/opt/impala/scratch/remote-buffer:150G,/opt/impala/scratch/local-scratch:150G
    s3a://dw-bucket/scratch/impala-test,/opt/impala/scratch/remote-buffer:150G,/opt/impala/scratch/local-scratch:150G
    
  2. If no active queries, run the following command to delete the directory:
    aws s3 rm s3://dw-bucket/scratch/impala-test --recursive
  3. If active queries are running, filter directories associated with running queries before cleanup.
    1. Get a list of active query IDs from the /admission page on the coordinator’s WebUI under the "Running queries" section. If using multiple active coordinators then get active query IDs from each coordinator’s /admission page.
    2. Use the following script to delete unneeded directories while excluding active query directories.
      This example excludes two queries, you can add more as needed:
      aws s3 rm s3://dw-bucket/scratch/impala-test/impala-scratch \ --exclude '*query_id_1*' --exclude '*query_id_2*' --recursive
    3. aws s3 rm s3://dw-bucket/scratch/impala-test/impala-scratch/[***HOSTNAME***] --recursive
      If an Impala daemon shuts down unexpectedly and does not restart on the original host, it may leave behind scratch files in a directory named after the hostname. These files remain in remote storage under the configured remote scratch path and are not automatically removed. If the host is no longer in use by the Impala daemon and the directory exists in remote storage, Cloudera recommends you remove it using the above command.
  • Temporary scratch files in S3 are removed, freeing up storage.
  • Active query data remains unaffected by the cleanup process.