Impala Requirements
To perform as expected, Impala depends on the availability of the software, hardware, and configurations described in the following sections.
Product Compatibility Matrix
The ultimate source of truth about compatibility between various versions of Cloudera Runtime, Cloudera Manager, and various Runtime components is the Product Compatibility Matrix.
Supported Operating Systems
The relevant supported operating systems and versions for Impala are the same as for the corresponding Cloudera Runtime platforms. For details, see the Operating System Requirements topic.
Hive Metastore and Related Configuration
Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking metadata about schema objects such as tables and columns. The following components are prerequisites for Impala:
To install the metastore:
- Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.
- Download the MySQL Connector or the PostgreSQL connector and place it in the
/usr/share/java/
directory. - Use the appropriate command line tool for your database to create the metastore database.
- Use the appropriate command line tool for your database to grant privileges for the
metastore database to the
hive
user. - Modify
hive-site.xml
to include information matching your particular database: its URL, username, and password. You will copy thehive-site.xml
file to the Impala Configuration Directory later in the Impala installation process.
Java Dependencies
Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components:
- The officially supported JVMs for Impala are the OpenJDK JVM and Oracle JVM. Other JVMs might cause issues, typically resulting in a failure at impalad startup. In particular, the JamVM used by default on certain levels of Ubuntu systems can cause impalad to fail to start.
- Internally, the impalad daemon relies on the
JAVA_HOME
environment variable to locate the system Java libraries. Make sure the impalad service is not run from an environment with an incorrect setting for this variable. - All Java dependencies are packaged in the
impala-dependencies.jar
file, which is located at/usr/lib/impala/lib/
. These map to everything that is built underfe/target/dependency
.
Networking Configuration Requirements
As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using network connections to work with remote data. To support this goal, Impala matches the hostname provided to each Impala daemon with the IP address of each DataNode by resolving the hostname flag to an IP address. For Impala to work with local data, use a single IP interface for the DataNode and the Impala daemon on each machine. Ensure that the Impala daemon's hostname flag resolves to the IP address of the DataNode. For single-homed machines, this is usually automatic, but for multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct interface. Impala tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log in a message of the form:
Using hostname: impala-daemon-1.example.com
--hostname
flag.Glibc Version Requirement
Impala daemons should be deployed on nodes using the same Glibc version since different Glibc version supports different Unicode standard version. Not using the same Glibc version might result in inconsistent UTF-8 behavior when UTF8_MODE is set to true.
Hardware Requirements
The memory allocation should be consistent across Impala executor nodes. A single Impala executor with a lower memory limit than the rest can easily become a bottleneck and lead to suboptimal performance.
This guideline does not apply to coordinator-only nodes.
Hardware Requirements for Optimal Join Performance
During join operations, portions of data from each joined table are loaded into memory. Data sets can be very large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate completing.
While requirements vary according to data set size, the following is generally recommended:
- CPU
Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.
- Memory
128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query processing on a particular node exceed the amount of memory available to Impala on that node, the query writes temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join tables that are much larger than the memory available on an individual node.
- JVM Heap Size for Catalog Server
4 GB or more recommended, ideally 8 GB or more, to accommodate the maximum numbers of tables, partitions, and data files you are planning to use with Impala.
- Storage
DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk performance with Impala. Ensure that you have sufficient disk space to store the data Impala will be querying.
OS Memory Map Area Requirement for Impala
Default settings for max_map_count
can be insufficient for Impala
installations that run with many concurrent queries. We recommend increasing
max_map_count
to avoid potential failures due to exhausting memory
mapping limits under heavy load.
max_map_count
is the OS virtual memory parameter and defines the maximum
number of memory map areas that a process can use.
To increase the max_map_count
parameter in your OS and to make the above
setting durable, refer to your OS documentation.
-
For example, if you are using RHEL 9, add the following line to /etc/sysctl.conf:
/etc/sysctl.conf
:vm.max_map_count=8000000 -
Reload the config as root: sysctl -p.
-
Check the new value: cat /proc/sys/vm/max_map_count.
-
Restart.
User Account Requirements
For user account requirements, see the topic User Account Requirements in the Impala documentation.