Scalability Considerations
The size of your cluster and the volume of data influences query performance. Typically, adding more cluster capacity reduces problems due to memory limits or disk throughput. On the other hand, larger clusters are more likely to have other kinds of scalability issues, such as a single slow node that causes performance problems for queries.
Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage
Because Hadoop I/O is optimized for reading and writing large files,
Impala is optimized for tables containing relatively few, large data
files. Schemas containing thousands of tables, or tables containing
thousands of partitions, can encounter performance issues during startup
or during DDL operations such as ALTER TABLE
statements.
Scalability Considerations for the Impala StateStore
If it takes a very long time for a cluster to start up with the
message, This Impala daemon is not ready to accept user
requests
, the StateStore might be taking too long to send the
entire catalog topic to the cluster. In this case, consider setting the
Load Catalog in Background field to false in
your Catalog Service configuration. This setting stops the StateStore
from loading the entire catalog into memory at cluster startup. Instead,
metadata for each table is loaded when the table is accessed for the
first time.
Effect of Buffer Pool on Memory Usage
Most of the memory needed is reserved at the beginning of the query, avoiding cases where a query might run for a long time before failing with an out-of-memory error. The actual memory estimates and memory buffers are typically smaller than before, so that more queries can run concurrently or process larger volumes of data than previously.
Increase the MAX_ROW_SIZE
query option setting when
querying tables with columns containing long strings, many columns, or
other combinations of factors that produce very large rows. If Impala
encounters rows that are too large to process with the default query
option settings, the query fails with an error message suggesting to
increase the MAX_ROW_SIZE
setting.
SQL Operations that Spill to Disk
Certain memory-intensive operations write temporary data to disk (known as spilling to disk) when Impala is close to exceeding its memory limit on a particular host.
What kinds of queries might spill to disk:
Several SQL clauses and constructs require memory allocations that could trigger spilling to disk:
-
when a query uses a
GROUP BY
clause for columns with millions or billions of distinct values, Impala keeps a similar number of temporary results in memory, to accumulate the aggregate results for each value in the group. -
When large tables are joined together, Impala keeps the values of the join columns from one table in memory, to compare them to incoming values from the other table.
-
When a large result set is sorted by the
ORDER BY
clause, each node sorts its portion of the result set in memory. -
The
DISTINCT
andUNION
operators build in-memory data structures to represent all values found so far, to eliminate duplicates as the query progresses.
When the spill-to-disk feature is activated for a join node within a query, Impala does not produce any runtime filters for that join operation on that host. Other join nodes within the query are not affected.
The amount data depends on the portion of the data being handled by that host, and thus the operator may end up consuming different amounts of memory on different hosts.
How Impala handles scratch disk space for spilling:
By default, intermediate files used during large sorts, joins,
aggregations, or analytic function operations are stored in the
directory /tmp/impala-scratch. These files are
removed when the operation finishes. Multiple concurrent queries can
perform operations that use the spill to disk
technique, without
any name conflicts for these temporary files.
You can specify a different location in the Cloudera Manager in the Impala Daemon Scratch Directories field. You can specify a single directory, or a comma-separated list of directories.
The scratch directories must be on the local filesystem, not in HDFS.
You might specify different directory paths for different hosts, depending on the capacity and speed of the available storage devices.
If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a warning message to its log.
If Impala encounters an error reading or writing files in a scratch directory during a query, Impala logs the error, and the query fails.
Memory usage for SQL operators:
The memory required to spill to disk is
reserved up front, and you can examine it in the
EXPLAIN
plan when the EXPLAIN_LEVEL
query option is set to 2 or higher.
If an operator accumulates more data than can fit in the reserved memory, it can either reserve more memory to continue processing data in memory or start spilling data to temporary scratch files on disk. Thus, operators with spill-to-disk support can adapt to different memory constraints by using however much memory is available to speed up execution, yet tolerate low memory conditions by spilling data to disk.
The amount of data depends on the portion of the data being handled by that host, and thus the operator may end up consuming different amounts of memory on different hosts.
Avoiding queries that spill to disk:
Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid this situation by using the following steps:
- Detect how often queries spill to disk, and how much temporary data
is written. Refer to the following sources:
- The output of the
PROFILE
command in the impala-shell interpreter. This data shows the memory usage for each host and in total across the cluster. TheWriteIoBytes
counter reports how much data was written to disk for each operator during the query. - In Impala Queries in Cloudera Manager, you can see the peak memory usage for a query, combined across all nodes in the cluster.
- In the Queries tab in the Impala debug web user interface, select the query to examine and click the corresponding Profile link. This data breaks down the memory usage for a single host within the cluster, the host whose web interface you are connected to.
- The output of the
- Use one or more techniques to reduce the possibility of the queries
spilling to disk:
- Increase the Impala memory limit if practical. For example,
using the
SET MEM_LIMIT
SQL statement, increase the available memory by more than the amount of temporary data written to disk on a particular node. - Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and reduce the amount of memory required on each node.
- On a cluster with resources shared between Impala and other Hadoop components, use resource management features to allocate more memory for Impala.
- If the memory pressure is due to running many concurrent queries rather than a few memory-intensive ones, consider using the Impala admission control feature to lower the limit on the number of concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in memory usage and improve overall response times.
- Tune the queries with the highest memory requirements, using
one or more of the following techniques:
- Run the
COMPUTE STATS
statement for all tables involved in large-scale joins and aggregation queries. - Minimize your use of
STRING
columns in join columns. Prefer numeric values instead. - Examine the
EXPLAIN
plan to understand the execution strategy being used for the most resource-intensive queries. - If Impala still chooses a suboptimal execution strategy even with statistics available, or if it is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints to the most resource-intensive queries to select the right execution strategy.
- Run the
- If your queries experience substantial performance overhead due
to spilling, enable the
DISABLE_UNSAFE_SPILLS
query option. This option prevents queries whose memory usage is likely to be exorbitant from spilling to disk. As you tune problematic queries using the preceding steps, fewer and fewer will be cancelled by this option setting.
- Increase the Impala memory limit if practical. For example,
using the
When to use DISABLE_UNSAFE_SPILLS:
The DISABLE_UNSAFE_SPILLS
query option is suitable for
an environment with ad hoc queries whose performance characteristics and
memory usage are not known in advance. It prevents worst-case
scenario
queries that use large amounts of memory unnecessarily.
Thus, you might turn this option on within a session while developing
new SQL code, even though it is turned off for existing applications.
Organizations where table and column statistics are generally
up-to-date might leave this option turned on all the time, again to
avoid worst-case scenarios for untested queries or if a problem in the
ETL pipeline results in a table with no statistics. Turning on
DISABLE_UNSAFE_SPILLS
lets you fail fast
in
this case and immediately gather statistics or tune the problematic
queries.
Some organizations might leave this option turned off. For example,
you might have tables large enough that the COMPUTE
STATS
takes substantial time to run, making it impractical to
re-run after loading new data. If you have examined the
EXPLAIN
plans of your queries and know that they are
operating efficiently, you might leave
DISABLE_UNSAFE_SPILLS
turned off. In that case, you
know that any queries that spill will not go overboard with their memory
consumption.
Limits on Query Size and Complexity
There are hard-coded limits on the maximum size and complexity of queries. Currently, the maximum number of expressions in a query is 2000. You might exceed the limits with large or deeply nested queries produced by business intelligence tools or other query generators.
If you have the ability to customize such queries or the query
generation logic that produces them, replace sequences of repetitive
expressions with single operators such as IN
or
BETWEEN
that can represent multiple values or ranges.
For example, instead of a large number of OR
clauses:
WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
use a single IN
clause:
WHERE val IN (1,2,6,100,...)
Scalability Considerations for Impala I/O
Impala parallelizes its I/O operations aggressively, therefore the more disks you can attach to each host, the better. Impala retrieves data from disk so quickly using bulk read operations on large blocks, that most queries are CPU-bound rather than I/O-bound.
Because the kind of sequential scanning typically done by Impala queries does not benefit much from the random-access capabilities of SSDs, spinning disks typically provide the most cost-effective kind of storage for Impala data, with little or no performance penalty as compared to SSDs.
Resource management features such as YARN, Llama, and admission control typically constrain the amount of memory, CPU, or overall number of queries in a high-concurrency environment. Currently, there is no throttling mechanism for Impala I/O.
Scalability Considerations for Table Layout
Due to the overhead of retrieving and updating table metadata in the Metastore database, try to limit the number of columns in a table to a maximum of approximately 2000. Although Impala can handle wider tables than this, the Metastore overhead can become significant, leading to query performance that is slower than expected based on the actual data volume.
To minimize overhead related to the Metastore database and Impala query planning, try to limit the number of partitions for any partitioned table to a few tens of thousands.
If the volume of data within a table makes it
impractical to run exploratory queries, consider using the
TABLESAMPLE
clause to limit query processing to only
a percentage of data within the table. This technique reduces the
overhead for query startup, I/O to read the data, and the amount of
network, CPU, and memory needed to process intermediate results during
the query.
Kerberos-Related Network Overhead for Large Clusters
When Impala starts up, or after each kinit
refresh,
Impala sends a number of simultaneous requests to the KDC. For a cluster
with 100 hosts, the KDC might be able to process all the requests within
roughly 5 seconds. For a cluster with 1000 hosts, the time to process
the requests would be roughly 500 seconds. Impala also makes a number of
DNS requests at the same time as these Kerberos-related requests.
While these authentication requests are being processed, any submitted Impala queries will fail. During this period, the KDC and DNS may be slow to respond to requests from components other than Impala, so other secure services might be affected temporarily.
Avoiding CPU Hotspots for HDFS Cached Data
You can use the HDFS caching feature to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
To avoid hotspots, include the WITH REPLICATION
clause with the CREATE TABLE
or ALTER
TABLE
statements for tables that use HDFS caching. This
clause allows more than one host to cache the relevant data blocks, so
the CPU load can be shared, reducing the load on any one host.
The work for HDFS cached data is divided better among all the hosts
that have cached replicas for a particular data block. When more than
one host has a cached replica for a data block, Impala assigns the work
of processing that block to whichever host has done the least work (in
terms of number of bytes read) for the current query. If hotspots
persist even with this load-based scheduling algorithm, you can enable
the query option SCHEDULE_RANDOM_REPLICA=TRUE
to
further distribute the CPU load. This setting causes Impala to randomly
pick a host to process a cached data block if the scheduling algorithm
encounters a tie when deciding which host has done the least work.
Scalability Considerations for File Handle Caching
One scalability aspect that affects heavily loaded clusters is the
load on the metadata layer from looking up the details as each file is
opened. On HDFS, that can lead to increased load on the NameNode, and on
S3, this can lead to an excessive number of S3 metadata requests. For
example, a query that does a full table scan on a partitioned table may
need to read thousands of partitions, each partition containing multiple
data files. Accessing each column of a Parquet file also involves a
separate open
call, further increasing the load on the NameNode.
High NameNode overhead can add startup time (that is, increase latency)
to Impala queries, and reduce overall throughput for non-Impala
workloads that also require accessing HDFS files.
You can reduce the number of calls made to your file system's metadata
layer by enabling the file handle caching feature. Data files that are
accessed by different queries, or even multiple times within the same
query, can be accessed without a new open
call and without
fetching the file details multiple times.
- HDFS
The
cache_remote_file_handles
flag controls local and remote file handle caching for animpalad
. It is recommended that you use the default value oftrue
as this caching prevents your NameNode from overloading when your cluster has many remote HDFS reads. - S3
The
cache_s3_file_handles
impalad
flag controls the S3 file handle caching. The feature is enabled by default with the flag set totrue
.
The feature is enabled by default with 20,000 file handles to be
cached. To change the value, set the configuration option
Maximum Cached File Handles
(max_cached_file_handles
) to a non-zero value for
each Impala daemon (impalad
). From the initial default
value of 20000, adjust upward if NameNode request load is still
significant, or downward if it is more important to reduce the extra
memory usage on each host. Each cache entry consumes 6 KB, meaning that
caching 20,000 file handles requires up to 120 MB on each Impala
executor. The exact memory usage varies depending on how many file
handles have actually been cached; memory is freed as file handles are
evicted from the cache.
If a manual operation moves a file to the trashcan while the file
handle is cached, Impala still accesses the contents of that file. This
is a change from prior behavior. Previously, accessing a file that was
in the trashcan would cause an error. This behavior only applies to
non-Impala methods of removing files, not the Impala mechanisms such as
TRUNCATE TABLE
or DROP TABLE
.
If files are removed, replaced, or appended by operations outside of
Impala, the way to bring the file information up to date is to run the
REFRESH
statement on the table.
File handle cache entries are evicted as the cache fills up, or based on a timeout period when they have not been accessed for some time.
To evaluate the effectiveness of file handle caching for a particular
workload, issue the PROFILE
statement in
impala-shell or examine query profiles in the
Impala Web UI. Look for the ratio of
CachedFileHandlesHitCount
(ideally, should be high)
to CachedFileHandlesMissCount
(ideally, should be low).
Before starting any evaluation, run several representative queries to
warm up
the cache because the first time each data file is
accessed is always recorded as a cache miss.
To see metrics about file handle caching for each impalad instance, examine the following fields on the /metrics page in the Impala Web UI:
- impala-server.io.mgr.cached-file-handles-miss-count
- impala-server.io.mgr.num-cached-file-handles