Behavioral Changes in Impala

Functional adjustments and behavioral updates for Impala are introduced in Cloudera Runtime 7.3.2, its service packs and cumulative hotfixes.

Cloudera Runtime 7.3.2

Cloudera Runtime 7.3.2 introduces functional adjustments, behavioral updates for Impala, and includes all service packs and cumulative hotfixes from 7.3.1.100 through 7.3.1.706. For a comprehensive record of all functional adjustments in Cloudera Runtime 7.3.1.x, see Behavioral Changes.

Summary:
Impala no longer allows unauthenticated user impersonation through Knox JDBC gateway
Previous behavior:
When connecting to Impala using the Knox JDBC gateway URL (e.g., jdbc:impala://[knoxhost]:[knoxport]/...), you could specify any Impala user through the impala.doas.user property and successfully impersonate them, regardless of whether your Knox user had explicit Impala proxy user permissions. This unintended behavior bypassed security checks and allowed for delegation to users who should not have been accessible.
New behavior:
Knox JDBC Gateway authenticates as itself and delegates to the logged in user, so it has already performed some delegation. If impala.doas.user is passed by the client through Knox JDBC Gateway, it will now result in an error as we don't support secondary delegation.
Summary:
Query cancellation during frontend planning
Previous behavior:
Previously, you could not cancel queries while they were in the analysis or planning stages. If a query was waiting for metadata to be loaded from the Catalog Server or the Hive Metastore (HMS), it would continue to run until the frontend planning process was complete. You had to wait for the query to reach the execution stage before a cancellation request could take effect.
New behavior:
You can now cancel queries during the frontend planning and metadata operation stages. If a query is interrupted by a user or a timeout while in the planning stage, the system triggers an interruption to the specific thread. This is particularly effective when planning is waiting on external metadata services. The cancellation process blocks until the frontend reaches an interruption point and returns to the backend to finalize the query status.
Summary:
Default value for fe_service_threads increased to improve concurrency
Previous behavior:
The default value for the fe_service_threads setting was 64.
New behavior:
Starting with Cloudera on premises 7.3.2, the default value is 128.
Summary:
Cleanup subdirectories in truncate/insert overwrite if recursing listing is enabled
Previous behavior:
Impala did not consistently delete files located in subdirectories of external tables during TRUNCATE and INSERT OVERWRITE operations, even when recursive listing was enabled. This led to leftover data in subdirectories after these operations, resulting in data corruption.
New behavior:
After this change, directories are also deleted in addition to (non-hidden) data files, with the exception of hidden and ignored directories. Now, setting DELETE_STATS_IN_TRUNCATE=false is no longer supported by default when truncating non-transactional tables; attempting this will result in an exception. If the old behavior is absolutely required, you can set the --truncate_external_tables_with_hms flag to false, but be aware that this will also reintroduce the bug that was fixed by this change.

Apache Jira: IMPALA-14189, IMPALA-14224

Summary:
Parquet late materialization behavior has changed
Previous behavior:
Parquet late materialization feature was disabled by default. You would use the parquet_late_materialization_threshold query option to set the minimum number of consecutive filtered rows required to trigger late materialization. The default value was -1. The feature was not supported for collection columns.
New behavior:
Parquet late materialization feature is enabled by default for all types including collections. The parquet_late_materialization_threshold is now set to 1 if the query option is greater than or equal to 0 and there is a collection value that can be skipped. Otherwise, the value is the same as the query option, which defaults to 20.

Apache Jira: IMPALA-3841

Summary:
TCP Keepalive is now enabled by default for client connections
Previous behavior:
TCP keepalive was disabled by default for client connections. Idle connections dropped by load balancers remained active in Impala, consuming service threads (fe_service_threads).
New behavior:
TCP keepalive is now enabled by default for all client connections, enhancing stability and availability. Impala is configured to check idle connections aggressively, every 10 minutes.

Apache Jira: IMPALA-14031

Summary:
Support for load-based routing in impala-proxy
Previous behavior:
The impala-proxy used a random selection policy to choose a coordinator. This approach did not consider the current load on each coordinator, which led to an uneven distribution of connections and potential performance bottlenecks.
New behavior:
The impala-proxy now uses load-based routing to decide which coordinator should handle a new session request. The Impala proxy directs the new session to the coordinator with the minimum calculated load. You can customize how this load is calculated using the following parameters:
  • IMPALA_PROXY_COORDINATOR_LOAD_CPU_WEIGHT: Determines the weight applied to the current percentage of CPU utilization when calculating the coordinator's load.
  • IMPALA_PROXY_COORDINATOR_LOAD_MEMORY_WEIGHT: Determines the weight applied to the current percentage of memory utilization when calculating the coordinator's load.
By adjusting these weights, you can tune the Impala proxy to prioritize CPU or memory headroom when routing new sessions.
Summary:
New catalogd flag for HMS event sync defaults
Previous behavior:
Previously, disabling event processing globally required you to manually set the impala.disableHmsSync property for every individual database and table.
New behavior:
The new disable_hms_sync_by_default flag now defines the global default for event processing. If set to true, Impala skips event processing for all tables and databases unless the impala.disableHmsSync property is explicitly set to false at the table or database level. The priority for checking the sync status is the table property, followed by the database property, and finally the global default flag.

Apache Jira: IMPALA-14085