Transparent query retries in Impala

This feature allows Impala to automatically retry queries that fail due to cluster membership changes, improving fault tolerance without user intervention.

Transparent query retries will automatically retry any queries that fail due to cluster membership changes. A cluster membership change typically entails a node leaving the cluster before it crashed or for some other reason stopped responding to statestore heartbeats.

Traditionally, if a query runs on a node in the Impala cluster, and that node crashes, then the query will fail and it is up to the user to retry the query. With transparent query retries, the query will be automatically retried.

Queries are only retried if the query failed due to a cluster membership change. Trivial failures, like SQL parsing exceptions are not retried.
Cluster membership changes fall into two categories: membership updates from the statestore or node blacklisting events. Impalads periodically send heartbeats to the statestore, if an impalad stops sending heartbeats to the statestore then that impalad is removed from the cluster membership. Node blacklisting events occur when a query fails and as a result, an impalad in the cluster is added to the Coordinator's node blacklist. In this scenario, the query is retried.
For most users, query retries will be completely transparent, but users who want to know why a retry was necessary can use runtime profiles. Each query attempt is modelled as a completely new query. Thus, each query attempt has its own runtime profiles. Users can look through the profiles of the failed query attempts to determine why the query was retried.

Transparent query retries are turned off by default, but can be enabled via the RETRY_FAILED_QUERIES query option.