Parallelizing JDBC queries
Learn about how Impala uses row count estimation to enable parallel processing and improve performance for queries on JDBC tables.
Previously, the Impala planner generated a single-node plan with one scanner thread for queries accessing JDBC tables because table statistics were unavailable. This resulted in serial execution for large JDBC tables, which caused suboptimal performance for scans, joins, and aggregations over millions of rows.
To address this, Impala now estimates the number of rows in a JDBC table by running a COUNT query during the query preparation phase. The planner uses this cardinality estimate to:
- Assign multiple scanner threads to JDBC scan nodes.
- Introduce exchange nodes to parallelize data fetches across multiple JDBC connections.
- Produce more efficient join orders by comparing JDBC row counts against native Impala tables.
Query estimation configuration
You can provide a lower bound for these cardinality estimates by using the following backend flag: --min_jdbc_scan_cardinality=<value>. The default value is 10. This flag ensures the planner does not produce unrealistically low cardinality values.
Query execution optimization comparison
The following table compares the query performance before and after the implementation of JDBC row count estimation:
| Feature | Before optimization | After optimization |
|---|---|---|
| Cardinality | Fixed at 1 for all JDBC scans. | Derived from the COUNT query (for example, 150,000 rows). |
| Parallelism | Single fragment and single scanner thread. | Multiple fragments and multiple scanner threads (for example, 7 threads). |
| Operators | No EXCHANGE or MERGING EXCHANGE operators. | Includes EXCHANGE and MERGING EXCHANGE operators. |
| Join execution | Serial execution with no broadcast distribution. | Broadcast joins on small sides to improve parallelism. |
| Example runtime | Approximately 77 seconds. | Approximately 38 seconds (2x faster). |
