Failure detection for Catalog and StateStore

Learn how the Impala High Availability (HA) system maintains continuous operation by switching from primary to standby instances when a heartbeat check fails.

Catalog Failure Detection

The StateStore instance continuously sends heartbeat to its registered clients, including the primary and standby Catalog instances, to track Impala daemons in the cluster to determine if the daemon is healthy. If the StateStore finds the primary Catalog instance is not healthy but the standby Catalog instance is healthy, StateStore promotes the standby Catalog instance as primary instance and notifies all coordinators about this change. Coordinators switch over to the new primary Catalog instance.

When the system detects that the active catalogd is unhealthy, it initiates a failover to the standby Catalogd. During this brief transition, some nodes might not immediately recognize the new active Catalogd, causing currently running queries to fail due to lack of access to metadata. These failed queries need to be rerun after the failover is complete and the new active Catalogd is operational.

StateStore Failure Detection

The primary StateStore instance continuously sends heartbeat to its registered clients, and the standby StateStore instance. Each StateStore client registers with both active and standby statestore instances, and maintains the following information about the StateStore servers: the server IP and port, service role - primary/standby, the last time the heartbeat request was received, or number of missed heartbeats. A missing heartbeat response from the StateStore’s client indicates an unhealthy daemon. There is a flag that defines MAX_MISSED_HEARTBEAT_REQUEST_NUM as the consecutive number of missed heartbeat requests to indicate losing communication with the StateStore server from the client's point of view so that the client marks the StateStore server as down. Standby StateStore instance collects the connection states between the clients (Catalog, coordinators and executors) and primary StateStore instance in its heartbeat messages to the clients. If the standby StateStore instance misses MAX_MISSED_HEARTBEAT_REQUEST_NUM of heartbeat requests from the primary StateStore instance and majority of clients lose connections with the primary statestore, it takes over the primary role.