Parallelizing JDBC External Table queries

Learn about how Impala uses row count estimation to enable parallel processing and improve performance for queries on JDBC tables.

Previously, the Impala planner generated a single-node plan with one scanner thread for queries accessing JDBC tables because table statistics were unavailable. This resulted in serial execution for large JDBC tables, which caused suboptimal performance for scans, joins, and aggregations over millions of rows.

To address this, Impala now estimates the number of rows in a JDBC table by running a COUNT query during the query preparation phase. The planner uses this cardinality estimate to:

Assign multiple scanner threads to JDBC scan nodes.
Introduce exchange nodes to parallelize data fetches across multiple JDBC connections.
Produce more efficient join orders by comparing JDBC row counts against native Impala tables.

Query estimation configuration

You can provide a lower bound for these cardinality estimates by using the following backend flag: --min_jdbc_scan_cardinality=<value>. The default value is 10. This flag ensures the planner does not produce unrealistically low cardinality values.

Query execution optimization comparison

The following table compares the query performance before and after the implementation of JDBC row count estimation:


Feature	Before optimization	After optimization
Cardinality	Fixed at 1 for all JDBC scans.	Derived from the COUNT query (for example, 150,000 rows).
Parallelism	Single fragment and single scanner thread.	Multiple fragments and multiple scanner threads (for example, 7 threads).
Operators	No `EXCHANGE` or `MERGING EXCHANGE` operators.	Includes `EXCHANGE` and `MERGING EXCHANGE` operators.
Join execution	Serial execution with no broadcast distribution.	Broadcast joins on small sides to improve parallelism.
Example runtime	Approximately 77 seconds.	Approximately 38 seconds (2x faster).