Configuring Hive and Impala for high availability with Hue
To configure Hive for high availability with Hue, you must have two or more
HiveServer2 roles. For Impala, you must have two or more Impala daemon
(impalad
) roles.
- SSH network access to host machines with a HiveServer2 or Impala Daemon role.
- External database configured for each H2S and Impala Daemon.
- Hue Load Balancer Hive/Impala Load Balancer configured with Source IP Persistence.
Source IP Persistence
Without IP Persistence, you may encounter the error, “Results have expired, rerun the query if needed.
Hue supports High Availability through a "load balancer" to HiveServer2 and Impala. Because the underlying Hue thrift libraries reuse TCP connections in a pool, a single user session may not have the same TCP connection. If a TCP connection is balanced away from a HiveServer2 or Impalad instance, the user session and its queries (running or returned) can be lost and trigger the “Results have expired" error.
To prevent sessions from being lost, configure the Hive/Impala Load Balancer with Source IP Persistence so that each Hue instance sends all traffic to a single HiveServer2/Impala instance. Of course, this is not true load balancing, but a configuration for failover High Availability.
To prevent sessions from timing out while in use, add more Hue Server instances, so that each can be pinned to another HiveServer2/Impala instance. And for both HiveServer2/Impala, set the affinity timeout (that is, the timeout to close persisted sessions) to be longer than the Impala query and session timeouts.
For the best load distribution, create multiple profiles in your load balancer, per port, for both non-Hue clients and Hue clients. Have non-Hue clients distribute loads in a round robin and configure Hue clients with source IP Persistence on dedicated ports, for example, 21000 for impala-shell, 21050 for impala-jdbc, and 21051 for Hue.