Spooling Impala Query Results
In Impala, you can control how query results are materialized and returned to clients, e.g. impala-shell, Hue, JDBC apps.
Result spooling is turned off by default, but can be enabled via the
SPOOL_QUERY_RESULTS
query option.
- When query result spooling is disabled, Impala relies on clients to fetch results to
trigger the generation of more result row batches until all the result rows have been
produced. If a client issues a query without fetching all the results then that query
continues to be in the running state indefinitely and the query fragments continue to
consume the resources until the query is cancelled and unregistered, potentially tying up
resources and causing other queries to wait for an extended period of time in admission
control.
Impala would materialize rows on-demand where rows are created only when the client requests them.
For example, if a Hue user runs a complex query that returns 1000 rows, but does not scroll through all the returned rows, and then stays idle for a while, the query will remain running and will hold onto all of its resources until it is explicitly closed or the session times out.
- When query result spooling is enabled, result sets of queries are
eagerly fetched and spooled in the spooling location, either in memory
or on disk.
Once all result rows have been fetched and stored in the spooling location, the resources are freed up. Incoming client fetches can get the data from the spooled results.
Admission Control and Result Spooling
- MAX_RESULT_SPOOLING_MEM
-
The maximum amount of memory used when spooling query results. If this value is exceeded when spooling results, all memory will most likely be spilled to disk. Set to 100 MB by default.
- MAX_SPILLED_RESULT_SPOOLING_MEM
-
The maximum amount of memory that can be spilled to disk when spooling query results. Must be greater than or equal to
MAX_RESULT_SPOOLING_MEM
. If this value is exceeded, the coordinator fragment will block until the client has consumed enough rows to free up more memory. Set to 1 GB by default.
Fetch Timeout
FETCH_ROWS_TIMEOUT_MS
query option to
set the timeout when clients fetch rows. Timeout applies both when query
result spooling is enabled and disabled:- When result spooling is disabled (
SPOOL_QUERY_RESULTS = FALSE
), the timeout controls how long a client waits for a single row batch to be produced by the coordinator. - When result spooling is enabled ( (
SPOOL_QUERY_RESULTS = TRUE
), a client can fetch multiple row batches at a time, so this timeout controls the total time a client waits for row batches to be produced.
Explain Plans
EXPLAIN
plan output for
result spooling.F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
| Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1
PLAN-ROOT SINK
| mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0
- The
mem-estimate
for thePLAN-ROOT SINK
is an estimate of the amount of memory needed to spool all the rows returned by the query. - The
mem-reservation
is the number and size of the buffers necessary to spool the query results. By default, the read and write buffers are 2 MB in size each, which is why the default is 4 MB.
PlanRootSink
In Impala, the PlanRootSink
class controls
the passing of batches of rows to the clients and acts as a queue of
rows to be sent to clients.
-
When result spooling is disabled, a single batch or rows is sent to the
PlanRootSink
, and then the client must consume that batch before another one can be sent. -
When result spooling is enabled, multiple batches of rows can be sent to the
PlanRootSink
, and multiple batches can be consumed by the client.