Setting scratch space limit for spilling Impala queries in Azure environments

Certain memory-intensive SQL operations write temporary data to disk (known as spilling to disk) when Impala is close to exceeding its memory limit on a particular host. This enables a query to complete successfully instead of failing with an out-of-memory error.

The compute nodes on Azure environments use the Standard_E16ds_v4 memory-optimized Virtual Machines by default that come with temporary SSD storage of 2 * 300 GiB (600 GiB). This space is utilized for both cached data and scratch space. The scratch space available on this temporary storage is insufficient for the spilling Impala queries.

Spilling queries require a scratch space in multiples of several 100 GiBs or TiBs, without which the queries may fail. To solve this problem, you can configure the scratch space between 300 GiB and 16 TiB per node while creating an Impala Virtual Warehouse in the Cloudera Data Warehouse (CDW) service. In addition to the local SSD, CDW allows you to use managed disks (Standard SSD and Premium disks) for extra space.

In CDW, the range of options for scratch space using the combination of the local SSDs and managed disks has been tailored to optimize the cost-to-space ratio. The following table shows the pricing and throughput information for the available scratch space options:
Table 1.
Disks

(SSD + managed disk)

Size

(GiB)

Throughput

(MB/s)

Price

($/hour)

SSD only 300 968 cached / 384 uncached 0
SSD + 3 * P10 684 300 (3 *100) 0.073 (3 * 0.024)
SSD + 2 * P15 812 250 (2 * 125) 0.094 (2 * 0.047)
SSD + 3 * P15 1068 375 (3 * 125) 0.141 (3 * 0.047)
SSD + 2 * P20 1324 300 (2 * 150) 0.182 (2 * 0.091)
SSD + 3 * P20 1836 450 (3 * 150) 0.273 (3 * 0.091)
SSD + P40 2348 250 0.322
SSD + 3 * P30 3372 600 (3 * 200) 0.504 (3 * 0.168)
SSD + P50 4396 250 0.617
SSD + E60 8492 400 0.841
SSD + E70 16684 600 1.683

You can select the scratch space limit only while creating a new Impala Virtual Warehouse. To set the scratch space limit for Impala queries:

  1. Log in to Cloudera Data Warehouse.
  2. In the Data Warehouse service, click Virtual Warehouses in the left navigation panel.
  3. On the Virtual Warehouses page, click Add New.
  4. Specify a Name.
  5. Select Virtual Warehouse type as IMPALA.
  6. Select the Database Catalog that it queries.
  7. Select a Size.
    Additional configuration options are displayed along with Scratch Space Limit per node (in GiBs).
  8. Select the scratch space from the Scratch Space Limit per node (in GiBs) dropdown.
  9. Click CREATE.