Known Issues in Apache Impala

Known issues and technical limitations for Impala are addressed in Cloudera Runtime 7.3.2, its service packs, and cumulative hotfixes.

Known issues identified in Cloudera Runtime 7.3.2

CDPD-90807: Thrift protocol limitation during Impala zero downtime upgrade (ZDU)
7.3.1.500 through 7.3.1.706, 7.3.2
Zero Downtime Upgrades (ZDU) for Impala are not supported when upgrading to version 7.3.2. This is due to Thrift protocol incompatibilities that can cause queries to fail during upgrade.
None
CDPD-90250: Incorrect file storage location for Impala tables when using S3 as default filesystem
7.3.2
When you created an external table in Impala by using the LOCATION parameter, the files were stored in S3 instead of HDFS. This occurred even if the provided path was intended for HDFS, provided that S3 was configured as the default filesystem (fs.defaultFS) in the cluster. This inconsistency leads to AccessDeniedException errors if the Impala service does not have the necessary Ranger permissions to write to the S3 bucket.
None
IMPALA-14472: Writing arrays to Kudu tables
7.3.2
Impala does not currently support writing array data into Kudu tables.
None

Apache Jira: IMPALA-14472

Known issues identified before Cloudera Runtime 7.3.2

DWX-20490: Impala queries fail with "Caught exception The read operation timed out, type=<class 'socket.timeout'> in ExecuteStatement"
7.3.1.500
Queries in impala-shell fail with a socket timeout error in execute statement which submits the query to the coordinator. The error occurs when query execution takes longer to start mainly when query planning is slow due to frequent metadata changes.
Increase the socket timeout on the client side. Set --client_connect_timeout_ms to a higher value, e.g. add --client_connect_timeout_ms=600000 to the impala-shell command line.
DWX-20491: Impala queries fail with EOFException: End of file reached before reading fully
7.3.1.500
Impala queries fail with an EOFException when reading from an HDFS file stored in an S3A location. The error occurs when the file is removed. If the file is removed using SQL commands like DROP PARTITION, there may be a significant lag in Hive Metastore event processing. If removed by non-SQL operations, run REFRESH or INVALIDATE METADATA on the table to resolve the issue.
Run REFRESH/INVALIDATE METADATA <table>;
CDPD-94720: Impala startup failure due to invalid TLS v1.3 ciphers
7.3.1 and its higher versions
When running Impala on a machine with OpenSSL 1.1.1, providing an invalid or TLS v1.2 ciphersuite in the --tls_ciphersuites startup flag causes the process to fail during startup. While OpenSSL 3.x ignores invalid ciphers, OpenSSL 1.1.1 returns an error if any ciphersuite in the list is invalid, even if other valid TLS v1.3 ciphers are present.
Ensure that the list in the --tls_ciphersuites startup flag contains only valid TLS v1.3 ciphersuites and does not contain any TLS v1.2 ciphersuites.

Apache Jira: IMPALA-14625

IMPALA-532: Impala should tolerate bad locale settings
7.3.1 and its higher versions
If the LC_* environment variables specify an unsupported locale, Impala does not start.
Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore daemon.
IMPALA-691: Process mem limit does not account for the JVM's memory usage
7.3.1 and its higher versions
Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the impalad daemon.
To monitor overall memory usage, use the top command, or add the memory figures in the Impala web UI /memz tab to JVM memory usage shown on the /metrics tab.
IMPALA-635: Avro Scanner fails to parse some schemas
7.3.1 and its higher versions
The default value in Avro schema must match type of first union type, e.g. if the default value is null, then the first type in the UNION must be "null".
Swap the order of the fields in the schema specification. For example, use ["null", "string"] instead of ["string", "null"]. Note that the files written with the problematic schema must be rewritten with the new schema because Avro files have embedded schemas.
IMPALA-1024: Impala BE cannot parse Avro schema that contains a trailing semi-colon
7.3.1 and its higher versions
If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried.
Remove trailing semicolon from the Avro schema.
IMPALA-1652: Incorrect results with basic predicate on CHAR typed column
7.3.1 and its higher versions
When comparing a CHAR column value to a string literal, the literal value is not blank-padded and so the comparison might fail when it should match.
Use the RPAD() function to blank-pad literals compared with CHAR columns to the expected length.
IMPALA-1821: Casting scenarios with invalid/inconsistent results
7.3.1 and its higher versions
Using a CAST() function to convert large literal values to smaller types, or to convert special values such as NaN or Inf, produces values not consistent with other database systems. This could lead to unexpected results from queries.
None
IMPALA-2005: A failed CTAS does not drop the table if the insert fails
7.3.1 and its higher versions
If a CREATE TABLE AS SELECT operation successfully creates the target table but an error occurs while querying the source table or copying the data, the new table is left behind rather than being dropped.
Drop the new table manually after a failed CREATE TABLE AS SELECT
IMPALA-3509: Breakpad minidumps can be very large when the thread count is high
7.3.1 and its higher versions
The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads.
Add -\-minidump_size_limit_hint_kb=size to set a soft upper limit on the size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump file can still grow larger than the "hinted" size. For example, if you have 10,000 threads, the minidump file can be more than 20 MB.
IMPALA-4978: Impala requires FQDN from hostname command on Kerberized clusters
7.3.1 and its higher versions
The method Impala uses to retrieve the host name while constructing the Kerberos principal is the gethostname() system call. This function might not always return the fully qualified domain name, depending on the network configuration. If the daemons cannot determine the FQDN, Impala does not start on a Kerberized cluster.
Test if a host is affected by checking whether the output of the hostname command includes the FQDN. On hosts where hostname, only returns the short name, pass the command-line flag ‑‑hostname=fully_qualified_domain_name in the startup options of all Impala-related daemons.
IMPALA-6671: Metadata operations block read-only operations on unrelated tables
7.3.1 and its higher versions
Metadata operations that change the state of a table, like COMPUTE STATS or ALTER RECOVER PARTITIONS, may delay metadata propagation of unrelated unloaded tables triggered by statements like DESCRIBE or SELECT queries.
None
IMPALA-7072: Impala does not support Heimdal Kerberos
None
CDPD-28139: Set spark.hadoop.hive.stats.autogather to false by default
As an Impala user, if you submit a query against a table containing data ingested using Spark and you are concerned about the quality of the query plan, you must run COMPUTE STATS against such a table in any case after an ETL operation because numRows created by Spark could be incorrect. Also, use other stats computed by COMPUTE STATS, e.g., Number of Distinct Values (NDV) and NULL count for good selectivity estimates.
For example, when a user ingests data from a file into a partition of an existing table using Spark, if spark.hadoop.hive.stats.autogather is not set to false explicitly, numRows associated with this partition would be 0 even though there is at least one row in the file. To avoid this, the workaround is to set "spark.hadoop.hive.stats.autogather=false" in the "Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf" in Spark's CM Configuration section.
IMPALA-2422: % escaping does not work correctly when occurs at the end in a LIKE clause
7.3.1 and its higher versions
If the final character in the RHS argument of a LIKE operator is an escaped \% character, it does not match a % final character of the LHS argument.
None
IMPALA-2603: Crash: impala::Coordinator::ValidateCollectionSlots
A query could encounter a serious error if includes multiple nested levels of INNER JOIN clauses involving subqueries.
None
CDPD-59625: Impala shell in RHEL 9 with Python 2 as default does not work
7.1.9, 7.3.1 and its higher version
If you try to run impala-shell on RHEL 9 by setting the default python executable available in PATH to Python 2, it will fail since RHEL 9 is compatible only with Python 3.
If you run into such issues, set this parameter pointing to Python 3, IMPALA_PYTHON_EXECUTABLE=python3.
Impala cannot update table if the 'external.table.purge' property is not set to true

Impala cannot update a table using DDL statements if the 'external.table.purge' property is FALSE. ALTER TABLE statements return success with no changes to the table.

ALTER TABLE statements should be issued twice if "external.table.purge" was FALSE initially.
Impala's known limitation when querying compacted tables
When the compaction process deletes the files for a table from the underlying HDFS location, the Impala service does not detect the changes as the compactions does not allocate new write ids. When the same table is queried from Impala it throws a 'File does not exist' exception that looks something like this:
Query Status: Disk I/O error on <node>:22000: Failed to open HDFS file hdfs://nameservice1/warehouse/tablespace/managed/hive/<database>/<table>/xxxxx
Error(2): No such file or directory Root cause: RemoteException: File does not exist: /warehouse/tablespace/managed/hive/<database>/<table>/xxxx
Use the REFRESH/INVALIDATE statements on the affected table to overcome the 'File does not exist' exception.
Impala api calls via knox require configuration if the knox customized kerberos principal name is a default service user name
To access impala api calls via knox, if the knox customized kerberos principal name is a default service user name, then configure "authorized_proxy_user_config" by clicking Clusters->impala->configuration. Include the knox customized kerberos principal name in the comma separated list of values <knox_custom_kerberos_principal_name>=*" where <knox_custom_kerberos_principal_name> is the value of the Kerberos Principal in the Knox service. Select Clusters>Knox>Configuration and search for Kerberos Principal to display this value.
CDPD-28431: Intermittent errors could be potentially encountered when Impala UI is accessed from multiple Knox nodes.
7.1.7
You must use a single Knox node to access Impala UI.
CDPD-21828: Multiple permission assignment through grant is not working
7.1.7
None
IMPALA-11871: INSERT statement does not respect Ranger policies for HDFS
7.3.1, 7.3.1.300 and its higher version
7.3.1.100, 7.3.1.200

In a cluster with Ranger auth (and with legacy catalog mode), even if you provide RWX to cm_hdfs -> all-path for the user impala, inserting into a table whose HDFS POSIX permissions happen to exclude impala access will result in "AnalysisException: Unable to INSERT into target table (default.t1) because Impala does not have WRITE access to HDFS location: hdfs://XXXXXXXXXXXX"

OPSAPS-46641: A single parameter exists in Cloudera Manager for specifying the Impala Daemon Load Balancer. Because BDR and Hue need to use different ports when connecting to the load balancer, it is not possible to configure the load balancer value so that BDR and Hue will work correctly in the same cluster.
The workaround is to use the load balancer configuration either without a port specification, or with the Beeswax port: this will configure BDR. To configure Hue use the "Hue Server Advanced Configuration Snippet (Safety Valve) for impalad_flags" to specify the the load balancer address with the HiveServer2 port.
Impala known limitation when querying compacted tables
7.3.1 and its higher versions
When the compaction process deletes the files for a table from the underlying HDFS location, the Impala service does not detect the changes as the compactions does not allocate new write ids. When the same table is queried from Impala it throws a 'File does not exist' exception that looks something like this:
Query Status: Disk I/O error on <node>:22000: Failed to open HDFS file hdfs://nameservice1/warehouse/tablespace/managed/hive/<database>/<table>/xxxxx
Error(2): No such file or directory Root cause: RemoteException: File does not exist: /warehouse/tablespace/managed/hive/<database>/<table>/xxxx
Use the REFRESH/INVALIDATE statements on the affected table to overcome the 'File does not exist' exception.
Impala Virtual Warehouses might produce an error when querying transactional (ACID) tables
Problem: If you are querying transactional (ACID) tables with an Impala Virtual Warehouse and compaction is run on the compacting Hive Virtual Warehouse, the query might fail. The compacting process deletes files and the Impala Virtual Warehouse might not be aware of the deletion. Then when the Impala Virtual Warehouse attempts to read the deleted file, an error can occur. This situation occurs randomly.
Run the INVALIDATE METADATA statement on the transactional (ACID) table to refresh the metadata. This fixes the problem until the next compaction occurs.
IMPALA-5605: Configuration to prevent crashes caused by thread resource limits
Impala could encounter a serious error due to resource usage under very high concurrency. The error message is similar to:

F0629 08:20:02.956413 29088 llvm-codegen.cc:111] LLVM hit fatal error: Unable to allocate section memory!
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >'
          
To prevent such errors, configure each host running an impalad daemon with the following settings:

            echo 2000000 > /proc/sys/kernel/threads-max
            echo 2000000 > /proc/sys/kernel/pid_max
            echo 8000000 > /proc/sys/vm/max_map_count
          
Add the following lines in /etc/security/limits.conf:

            impala soft nproc 262144
            impala hard nproc 262144
          
IMPALA-9350: Ranger audit logs for applying column masking policies missing
Impala is not producing these logs.
None
IMPALA-1792: ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)
If the ODBC SQLGetData is called on a series of columns, the function calls must follow the same order as the columns. For example, if data is fetched from column 2 then column 1, the SQLGetData call for column 1 returns NULL.
Fetch columns in the same order they are defined in the table.