Configure Impala Daemon to spill to S3

Impala occasionally needs to use persistent storage for writing intermediate files during large sorts, joins, aggregations, or analytic function operations. If your workload results in large volumes of intermediate data being written, it is recommended to configure the heavy spilling queries to use a remote storage location rather than the local one. The advantage of using remote storage for scratch space is that it is elastic and can handle any amount of spilling.

Before you begin

Identify the URL for an S3 bucket to which you want your new Impala to write the temporary data. If you use the S3 bucket that is associated with the environment, navigate to the S3 bucket and copy the URL. If you want to use an external S3 bucket, you must first configure your CDP environment to use the external S3 bucket with the correct read/write permissions.

Configuring the Start-up Option in Impala daemon

You can use the Impalad start option scratch_dirs to specify the locations of the intermediate files. The format of the option is scratch_dirs= remote_dir, local_buffer_dir(, local_dir…).

With the option specified above:

  • You can specify only one remote directory.
    When you configure a remote directory, you must specify a local buffer directory as the buffer. However you can use multiple local directories with the remote directory. If you specify multiple local directories, the first local directory would be used as the local buffer directory.
  • If you configure both remote and local directories, the remote directory is only used when the local directories are fully utilized.
  • The size of a remote intermediate file could affect the query performance, and the value can be set by remote_tmp_file_size in the start-up option. The default size of a remote intermediate file is 16MB while the maximum is 256MB.

Examples

  • A remote scratch dir with one local buffer dir, file size 64MB.

‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir"
    ‑‑remote_tmp_file_size=64M
  • A remote scratch dir with one local buffer dir, and one local dir.

‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir, /local_dir"
  • A remote scratch dir with one local buffer dir, and multiple local dirs.

‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir, /local_dir_1,
    /local_dir_2"