Setting scratch space limit for spilling Impala queries in Azure environments

Certain memory-intensive SQL operations write temporary data to disk (known as spilling to disk) when Impala is close to exceeding its memory limit on a particular host. Consequently, a query can complete successfully instead of failing with an out-of-memory error. You must set the scratch space limit carefully because you cannot change the limit later.

The compute nodes on Azure environments use the STANDARD_E16DS_v4 or STANDARD_E16DS_V5 memory-optimized Virtual Machines by default. These nodes have a temporary SSD storage of 2 * 300 GiB (600 GiB). This space is utilized for both cached data and scratch space. The scratch space available is insufficient for the spilling Impala queries.

Spilling queries to disk require a scratch space in multiples of several 100 GiBs or TiBs, without which the queries may fail. To prevent failure, you can configure the scratch space between 300 GiB and 16 TiB per executor node while creating an Impala Virtual Warehouse. In addition to the local SSD, you can use managed disks (Standard SSD and Premium disks) for extra space.

In CDW, the range of options for scratch space using the combination of the local SSDs and managed disks has been tailored to optimize the cost-to-space ratio. The following table shows the pricing and throughput information for the available scratch space options:
Table 1.
Disks

(SSD + managed disk)

Size

(GiB)

Throughput

(MB/s)

Price

($/hour)

SSD only 300 968 cached / 384 uncached 0
SSD + 3 * P10 684 300 (3 *100) 0.073 (3 * 0.024)
SSD + 2 * P15 812 250 (2 * 125) 0.094 (2 * 0.047)
SSD + 3 * P15 1068 375 (3 * 125) 0.141 (3 * 0.047)
SSD + 2 * P20 1324 300 (2 * 150) 0.182 (2 * 0.091)
SSD + 3 * P20 1836 450 (3 * 150) 0.273 (3 * 0.091)
SSD + P40 2348 250 0.322
SSD + 3 * P30 3372 600 (3 * 200) 0.504 (3 * 0.168)
SSD + P50 4396 250 0.617
SSD + E60 8492 400 0.841
SSD + E70 16684 600 1.683

You can select the scratch space limit only while creating a new Impala Virtual Warehouse.

  • Obtain the CDW_IMPALA_EBS_SCRATCH_SPACE entitlement.
  • Activate the environment in CDW using the E16ds_v4 or E16ds_v5.
  • Disable Availability Zones for AKS during the environment activation; otherwise, you cannot set the scratch space limit.
  1. Log in to Cloudera Data Warehouse.
  2. In the Data Warehouse service, click Virtual Warehouses in the left navigation panel.
  3. On the Virtual Warehouses page, click Add New.
  4. Specify a Name.
  5. Select Virtual Warehouse type as IMPALA.
  6. Select the Database Catalog that it queries.
  7. Select a Size.
    Additional configuration options are displayed along with Scratch Space Limit per node (in GiBs).
  8. Select the scratch space from the Scratch Space Limit per node (in GiBs) dropdown.
  9. Click CREATE.