Fixed issues

Review the fixed issues in this release of the Cloudera Data Warehouse service on cloud.

CDPD-89414: Incorrect results for window functions with IGNORE NULLS
When you used the FIRST_VALUE and LAST_VALUE window functions with the IGNORE NULLS clause while vectorization was enabled, the results were incorrect. This occurred because the vectorized execution engine did not properly handle the IGNORE NULLS setting for these functions.
This issue is addressed by modifying the vectorized processing for FIRST_VALUE and LAST_VALUE to correctly respect the IGNORE NULLS clause, ensuring the same results are produced whether vectorization is enabled or disabled.

Apache Jira: HIVE-29122

CDPD-60770: Passwords with special characters fail to connect with Beeline
When you used a password containing special characters like #, ^, or ; in a JDBC URL for a Beeline connection, the connection failed with a 401 error. This happened because Beeline did not correctly interpret these special characters in the password.
This issue is resolved by introducing a new method to reparse the password from the original JDBC URL, allowing Beeline to correctly handle and authenticate passwords containing special characters.

Apache Jira: HIVE-28805

CDPD-85600: Select queries with ORDER BY fail due to compression error
When you ran a Hive SELECT query with an ORDER BY clause, it failed with a java.io.IOException and java.lang.UnsatisfiedLinkError related to the zlib decompressor.
The issue was addressed by ensuring the zlib native library is correctly loaded.

Apache Jira: HIVE-28805

CDPD-90301: Stack overflow error from queries with OR and MIN filters
Queries, cause a stack overflow error when they contained multiple OR conditions on the same expression, such as MINUTE(date_) = 2 OR MINUTE(date_) = 10.
This issue is addressed by modifying the HivePointLookupOptimizerRule to keep the original order of expressions and to check if a merge can be performed before creating a new expression.

Apache Jira: HIVE-29208

CDPD-90303: Incorrect results from a CASE expression
A query that used a CASE expression to conditionally return values produced an incorrect result. The query plan incorrectly folded the CASE statement into a COALESCE function, which led to a logic error that filtered out some of the expected results.
This issue is addressed by adding a more strict check when converting CASE expressions into COALESCE during query optimization.

Apache Jira: HIVE-24902

CDPD-80655: Compile error with ambiguous column reference
A Hive query using CREATE TABLE AS SELECT with a GROUP BY clause and a window function failed with an "Ambiguous column reference" error. This happened because the query plan couldn't correctly handle redundant keys in the GROUP BY clause.
This issue is fixed by improving the query planner's logic to properly handle complex expressions and their aliases within window functions, allowing the query to compile and run successfully.

Apache Jira: HIVE-28878

DWX-20754: Invalid column reference in lateral view queries
The virtual column BLOCK__OFFSET__INSIDE__FILE fails to be correctly referenced in queries using lateral views, resulting in the error:
FAILED: SemanticException Line 0:-1 Invalid column reference 'BLOCK_OFFSET_INSIDE_FILE.

This issue is now resolved.

Apache Jira:HIVE-28938

DWX-21855: Impala Executors fail to gracefully shutdown
During graceful shutdown Impala executors wait for running queries to finish up to the graceful shutdown deadline (--shutdown_deadline_s). During graceful shutdown the istio-proxy container on Impala executor pod was getting terminated immediately and as a result the executors were not reachable and were removed from the Impala cluster membership resulting in cancellation of running queries.
This issue is now resolved by making sure istio-proxy container’s lifecycle does not impact executor’s cluster membership.
IMPALA-14263: Enhanced join strategy for large clusters
The query planner's cost model for broadcast joins can be skewed by the number of nodes in a cluster. This could lead to suboptimal join strategy choices, especially in large clusters with skewed data where a partitioned join was chosen over a more efficient broadcast join.
This issue is now resolved by introducing the broadcast_cost_scale_factor query option as an additional tuning option besides query hint to override query planner decision. To set it cluster-wide for all queries, add the following key-value to the default_query_options startup option:
broadcast_cost_scale_factor=<less than 1.0>

Apache Jira: IMPALA-14263

IMPALA-11402: Fetching metadata for tables with huge numbers of files no longer fails with OutOfMemoryError
Previously, when Impala Coordinator tried to fetch file metadata for extremely large tables (those with millions of files or partitions), the Impala Catalog service would attempt to return all the file details at once. This often exceeded the Java memory limits, causing the service to crash with an OutOfMemoryError.
This issue is addressed by configuring the Catalog service to limit the number of file descriptors included in a single getPartialCatalogObject response. A new configuration flag, catalog_partial_fetch_max_files, is introduced to define the maximum number of file descriptors allowed per response (with a default of 1,000,000 files).
If a request exceeds this limit, the Catalog service will truncate the response and return metadata for only a subset of the requested partitions. The coordinator is now designed to detect this truncated response and automatically send new batch requests to fetch the remaining partitions until all required metadata is retrieved. This change ensures that the coordinator can successfully fetch and process the metadata for extremely large tables without crashing due to memory limits.

Apache Jira: IMPALA-11402

CDPD-83031: Client connections are now more stable thanks to enabled keepalive
Previously, TCP keepalive was not active by default for client connections. This caused problems, especially in environments that use a load balancer (a tool that manages network traffic).
TCP keepalive is now enabled by default for all client connections. Impala is set to check the status of idle connections aggressively—every 10 minutes—which is much faster than the standard system default. This quicker check ensures that dead connections (like those severed by a load balancer) are detected and cleaned up quickly. This frees up service threads faster, improving the overall stability and availability of the Impala service.

JIRA Issue: IMPALA-14031

CDPD-77261: Impala can now read Parquet integer data as DECIMAL after schema changes
Previously, if you changed a column type from an integer (INT or BIGINT) to a DECIMAL using ALTER TABLE, Impala could fail to read the original Parquet data files. This happened because the files lacked the specific metadata (logical types) Impala expected for decimals, resulting in an error.
Impala is now more flexible when reading Parquet files following schema evolution. If Impala encounters an integer type but the schema expects a DECIMAL, it automatically assumes a suitable decimal precision and scale, allowing you to successfully query the updated table:
  • INT32 is read as DECIMAL(9, 0).
  • INT64 is read as DECIMAL(18, 0).
This change supports common schema evolution practices by allowing you to update column types without manually rewriting old data files.

Apache Jira: IMPALA-13625

IMPALA-12927: Impala can now correctly read BINARY columns in JSON tables
Previously, Impala couldn't correctly read BINARY columns in JSON tables, often resulting in errors or incorrect data. This happened because Impala assumed the data was always Base64 encoded, which wasn't true for files written by older Hive versions.
Impala now supports a new table property, 'json.binary.format' (BASE64 or RAWSTRING), and a query option, JSON_BINARY_FORMAT, to explicitly define the binary encoding. This ensures Impala reads the data correctly. If no format is specified, Impala will now return an error instead of risking silent data corruption.

JIRA Issue: IMPALA-12927

IMPALA-13631: Impala cluster responsiveness during table renames
Performing ALTER TABLE RENAME operations caused Impala to hold a critical internal lock for too long, which blocks other DDL/DMLs.
This issue is resolved by ensuring that the critical internal lock is no longer held during long-running external calls initiated by ALTER TABLE RENAME operations. This prevents the entire Impala cluster from being blocked, allowing other queries and catalog operations to proceed without interruption.

Apache Jira: IMPALA-13631

Catalogd and Event Processor Improvements
  • Faster Inserts for Partitioned Tables (IMPALA-14051): Inserting data into very large partitioned tables is now much faster. Previously, Impala communicated with the Hive Metastore (HMS) one partition at a time, which was a major slowdown. Impala now uses the batch insert API to send all insert information to the HMS in one highly efficient call, significantly boosting the performance of your INSERT statements into transactional tables.
  • Quicker Table Administration (IMPALA-13599): Administrative tasks, such as running DROP STATS or changing the CACHED status of a table, are now much faster on tables with many partitions. Impala previously made thousands of individual calls to the HMS for these operations. The system now batches these updates, making far fewer calls to the HMS and speeding up these essential administrative commands.
  • Reliable Table Renames (IMPALA-13989): The ALTER TABLE RENAME command no longer fails when an INVALIDATE METADATA command runs at the same time. Previously, this caused the rename to succeed in the Hive Metastore but fail in Impala's Catalog Server. Impala now includes automatic error handling that instantly runs an internal metadata refresh if the rename is interrupted, ensuring the rename completes successfully without requiring any manual user steps.
  • Efficient Partition Refreshes (IMPALA-13453): Running REFRESH <table> PARTITION <partition> is now much more efficient. Previously, this command always fully reloaded the partition's metadata and column statistics, even if the partition was unchanged. Impala now checks if the partition data has changed before reloading, avoiding the unnecessary drop-add sequence and significantly improving the efficiency of partition metadata updates.
  • Reduced Partition API Calls (IMPALA-13599): Impala has reduced unnecessary API interactions with the HMS during table-level operations. Commands like ALTER TABLE... SET CACHED/UNCACHED or DROP STATS on large tables previously generated thousands of single alter_partition() calls. Impala now utilizes the HMS's bulk-update functionality, batching these partition updates to drastically reduce the total number of required API calls.
  • REFRESH on multiple partitions (IMPALA-14089): Impala now supports using the REFRESH statement on multiple partitions within a single command, which significantly speeds up metadata updates by processing partitions in parallel, reduces lock contention in the Catalog service, and avoids unnecessary increases to the table version. See Impala REFRESH Statement

Apache Jira: IMPALA-14051, IMPALA-13599, IMPALA-13989, IMPALA-13453,IMPALA-14089

CDPD-81076: LEFT ANTI JOIN fails on Iceberg V2 tables with Delete files
Queries using a LEFT ANTI JOIN fail with an AnalysisException if the right-side table is an Iceberg V2 table containing delete files. For example, consider the following query:
SELECT * FROM table_a a
LEFT ANTI JOIN iceberg_v2_table b
ON a.id = b.id;

The error Illegal column/field reference'b.input_file_name' of semi-/anti-joined table 'b' is displayed because semi-joined tuples need to be explicitly made visible for paths pointing inside them to be resolvable.

The fix updates the IcebergScanPlanner to ensure that the tuple containing the virtual fields is made visible when it is semi-joined.

Apache Jira: IMPALA-13888

CDPD-81053: Enable MERGE statement for Iceberg tables with equality deletes
This patch fixes an issue that caused MERGE statements to fail on Iceberg tables that use equality deletes.

The failure occurred because the delete expression calculation was missing the data sequence number, even though the underlying data description included it. This mismatch caused row evaluation to fail.

The fix ensures the data sequence number is correctly included in the result expressions, allowing MERGE operations to complete successfully on these tables.

Apache Jira: IMPALA-13674

CDPD-77773: Tolerate missing data files during Iceberg table loading
This fix addresses an issue where an Iceberg table would fail to load completely if any of its data files were missing from the file system. This TableLoadingException left the table in an incomplete state, blocking all operations on it.

Impala now tolerates missing data files during the table loading process. An exception will only be thrown if a query subsequently attempts to read one of the specific files that is missing.

This change allows other operations that do not depend on the missing data—such as ROLLBACK, DROP PARTITION, or SELECT statements on valid partitions—to execute successfully.

Apache Jira: IMPALA-13654

CDPD-78508: Skip reloading Iceberg tables when metadata JSON file is the same
This patch optimizes metadata handling for Iceberg tables, particularly those that are updated frequently.

Previously, if an event processor was lagging, Impala might receive numerous update events for the same table (for example, 100 events). Impala would attempt to reload the table 100 times, even if the table's state was already up-to-date after processing the first event.

With this fix, Impala now compares the path of the incoming metadata JSON file with the one that is currently loaded. If the metadata file location is the same, Impala skips the reload, correctly assuming the table is already unchanged. This significantly reduces unnecessary metadata processing.

Apache Jira: IMPALA-13718

Fixed Common Vulnerabilities and Exposures

Common Vulnerabilities and Exposures (CVE) that are fixed in this release:

CVE Description
CVE-2025-30065 Code execution vulnerability in schema parsing of Apache Parquet-avro module in versions lower than 1.15.1.
CVE-2020-20703 Buffer overflow vulnerability in VIM v.8.1.2135 allows a remote attacker to execute arbitrary code using the operand parameter.
CVE-2024-53990 Cookie handling vulnerability in AsyncHttpClient (AHC) library leading to cross-user cookie misuse.
CVE-2024-52533 Buffer overflow vulnerability in GNOME GLib SOCKS4 proxy handling (gio/gsocks4aproxy.c).
CVE-2024-52046 Apache MINA ObjectSerializationDecoder vulnerability leading to Remote Code Execution (RCE).
CVE-2017-6519 Avahi-daemon IPv6 unicast query handling vulnerability leading to DoS and information leakage.