Failure detection for Catalog and StateStore
Learn how the Impala High Availability (HA) system maintains continuous operation by switching from primary to standby instances when a heartbeat check fails.
Catalog Failure Detection
The StateStore instance continuously sends heartbeat to its registered clients, including the primary and standby Catalog instances, to track Impala daemons in the cluster to determine if the daemon is healthy. If the StateStore finds the primary Catalog instance is not healthy but the standby Catalog instance is healthy, StateStore promotes the standby Catalog instance as primary instance and notifies all coordinators about this change. Coordinators switch over to the new primary Catalog instance.
When the system detects that the active catalogd is unhealthy, it initiates a failover to the standby Catalogd. During this brief transition, some nodes might not immediately recognize the new active Catalogd, causing currently running queries to fail due to lack of access to metadata. These failed queries need to be rerun after the failover is complete and the new active Catalogd is operational.
StateStore Failure Detection
The primary StateStore instance continuously sends heartbeat to its registered clients, and
the standby StateStore instance. Each StateStore client registers with both active and standby
statestore instances, and maintains the following information about the StateStore servers: the
server IP and port, service role - primary/standby, the last time the heartbeat request was
received, or number of missed heartbeats. A missing heartbeat response from the StateStore’s
client indicates an unhealthy daemon. There is a flag that defines
MAX_MISSED_HEARTBEAT_REQUEST_NUM
as the consecutive number of missed heartbeat
requests to indicate losing communication with the StateStore server from the client's point of
view so that the client marks the StateStore server as down. Standby StateStore instance
collects the connection states between the clients (Catalog, coordinators and executors) and
primary StateStore instance in its heartbeat messages to the clients. If the standby StateStore
instance misses MAX_MISSED_HEARTBEAT_REQUEST_NUM
of heartbeat requests from the
primary StateStore instance and majority of clients lose connections with the primary
statestore, it takes over the primary role.