Known Issues in Apache Impala
This topic describes known issues and workarounds for using Impala in this release of Cloudera Runtime.
Technical Service Bulletins
- TSB-2021-485: Impala returns fewer rows from parquet tables on S3
- IMPALA-10310 was an issue in Impala's Parquet page filtering
code where the scanner did not reset state appropriately when transitioning from the
first row group to subsequent row groups in a single split. This caused data from the
subsequent row groups to be skipped incorrectly, leading to incorrect query results.
This issue cannot occur when the Parquet page filtering is disabled by setting
PARQUET_READ_PAGE_INDEX=false.
The issue is more likely to be encountered on S3/ADLS/ABFS/etc, because Spark is sometimes configured to write 128MB row groups and the PARQUET_OBJECT_STORE_SPLIT_SIZE is 256MB. This makes it more likely for Impala to process two row groups in a single split.
Parquet page filtering only works based on the min/max statistics, therefore the comparison operators it supports are “=”, “<”, “>”, “<=”, and “>=”. These operators are impacted by this bug. Expressions such as “!=”, 'LIKE' or the expressions including UDF do not use parquet page filtering.
The PARQUET_OBJECT_STORE_SPLIT_SIZE parameter is introduced in Impala 3.3 by IMPALA-5843. This means that older versions of Impala do not have this issue.
- Upstream JIRA
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2021-485: Impala returns fewer rows from parquet tables on S3
- TSB 2021-502: Impala logs the session / operation secret on most RPCs at INFO level
- Impala logs contain the session / operation secret. With this information a person who
has access to the Impala logs might be able to hijack other users' sessions. This means
the attacker is able to execute statements for which they do not have the necessary
privileges otherwise. Impala deployments where Apache Sentry or Apache Ranger
authorization is enabled may be vulnerable to privilege escalation. Impala deployments
where audit logging is enabled may be vulnerable to incorrect audit
logging.
Restricting access to the Impala logs that expose secrets will reduce the risk of an attack. Additionally, restricting access to trusted users for the Impala deployment will also reduce the risk of an attack. Log redaction techniques can be used to redact secrets from the logs. For more information, see the Cloudera Manager documentation.
For log redaction, users can create a rule with a search pattern: secret \(string\) [=:].*And the replacement could be for example: secret=LOG-REDACTED
- Upstream JIRA
- IMPALA-10600
- Knowledge article
- For the latest update on this issue see the corresponding Knowledge article: TSB 2021-502: Impala logs the session / operation secret on most RPCs at INFO level