Latitude and longitude point filtering use case

Optimize performance for selective geospatial queries that use separate latitude and longitude columns.

In selective queries, filtering points stored as separate numeric columns is critical for performance. Impala accelerates these queries by using optimized Parquet and Iceberg reads alongside native code generation to bypass Java Virtual Machine (JVM) overhead. You can use this optimization when your data contains separate double columns for latitude and longitude rather than a single binary-encoded geometry column. This approach allows the engine to skip irrelevant data at the storage level before it reaches the processing layer.

Example

The following example shows a typical selective query that benefits from native acceleration:

create table lon_lat_tbl (lon double, lat double, s string) stored as iceberg;
insert into lon_lat_tbl values (1, 1, "abc");
explain select * from lon_lat_tbl where st_intersects(st_point(lon, lat), st_polygon("polygon ((2 2, 3 2, 3 3, 2 3, 2 2 ))")); -- it can be seen that the bounding rect check (lat <= 3, lat >= 2, lon <= 3, lon >= 2) is added to the predicate, which allows Parquet stat filtering, but Iceberg still doesn't skip the file during planning.

Iceberg fails to skip the file during planning happens because Impala pushing predicates down to Iceberg only if a predicate exists on a partitioning column. You can force predicate pushdown by setting the impala.iceberg.push_down_hint table property..

Example

alter table lon_lat_tbl set tblproperties("impala.iceberg.push_down_hint"="lon,lat"); -- repeating the same query leads to dropping the file during planning:
explain select * from lon_lat_tbl where st_intersects(st_point(lon, lat), st_polygon("polygon ((2 2, 3 2, 3 3, 2 3, 2 2 ))")); -- HDFS partitions=0/1 files=0 size=0B

This approach has the following benefits:

Storage-level filtering – Enables partition, file, and page-level filtering, which prevents the engine from evaluating every row.
Reduced overhead – Minimizes slow function calls caused by JNI transitions and data serialization.
Constant geometry optimization – Deserializes constant arguments only once, improving efficiency when the first argument of a function is constant.

You can further optimize the data by partitioning or sorting it based on a value that correlates with the location, such as a state or geohash.