This is the documentation for CDH 5.1.x. Documentation for other versions is available at Cloudera Documentation.

Using Impala through a Proxy for High Availability

For most clusters that have multiple users and production availability requirements, you might set up a proxy server to relay requests to and from Impala. This configuration has the following advantages:

Applications connect to a single well-known host and port, rather than keeping track of the hosts where the impalad daemon is running.
If any host running the impalad daemon becomes unavailable, application connection requests will still succeed because you always connect to the proxy server.
The "coordinator node" for each Impala query potentially requires more memory and CPU cycles than the other nodes that process the query. The proxy server can issue queries using round-robin scheduling, so that each connection uses a different coordinator node. This load-balancing technique lets the Impala nodes share this additional work, rather than concentrating it on a single machine.

The following setup steps are a general outline that apply to any load-balancing proxy software.

Download the load-balancing proxy software. It should only need to be installed and configured on a single host.
Configure the software (typically by editing a configuration file). Set up a port that the load balancer will listen on to relay Impala requests back and forth.
Specify the host and port settings for each Impala node. These are the hosts that the load balancer will choose from when relaying each Impala query. See Appendix A - Ports Used by Impala for when to use port 21000, 21050, or another value depending on what type of connections you are load balancing.
Run the load-balancing proxy server, pointing it at the configuration file that you set up.

Special Proxy Considerations for Clusters Using Kerberos

In a cluster using Kerberos, applications check host credentials to verify that the host they are connecting to is the same one that is actually processing the request, to prevent man-in-the-middle attacks. To clarify that the load-balancing proxy server is legitimate, perform these extra Kerberos setup steps:

This section assumes you are starting with a Kerberos-enabled cluster. See Enabling Kerberos Authentication for Impala for instructions for setting up Impala with Kerberos. See the CDH Security Guide for general steps to set up Kerberos: CDH 4 instructions or CDH 5 instructions.
Choose the host you will use for the proxy server. Based on the Kerberos setup procedure, it should already have an entry impala/proxy_host@realm in its keytab. If not, go back over the initial Kerberos configuration steps. to the keytab on each host running the impalad daemon.
Copy the keytab file from the proxy host to all other hosts in the cluster that run the impalad daemon. (For optimal performance, impalad should be running on all DataNodes in the cluster.) Put the keytab file in a secure location on each of these other hosts.
On systems not managed by Cloudera Manager, add an entry impala/actual_hostname@realm to the keytab on each host running the impalad daemon.
For each impalad node, merge the existing keytab with the proxy’s keytab using ktutil, producing a new keytab file. For example:
```
$ ktutil
ktutil: read_kt proxy.keytab
ktutil: read_kt impala.keytab
ktutil: write_kt proxy_impala.keytab
ktutil: quit
```
Make sure that the impala user has permission to read this merged keytab file.
- Change some configuration settings for each host in the cluster that participates in the load balancing. In the impalad option definition, or the Cloudera Manager safety valve (Cloudera Manager 4) or advanced configuration snippet (Cloudera Manager 5), add:
```
--principal=impala/proxy_host@realm
--be_principal=impala/actual_host@realm
--keytab_file=path_to_merged_keytab
```
  Note: Every host has a different --be_principal because the actual host name is different on each host.
- On a cluster managed by Cloudera Manager, create a role group to set the configuration values from the preceding step on a per-host basis.
- On a cluster not managed by Cloudera Manager, see Modifying Impala Startup Options for the procedure to modify the startup options.
- On a cluster managed by Cloudera Manager, restart the Impala service.
- On a cluster not managed by Cloudera Manager, restart the impalad daemons on all hosts in the cluster, as well as the statestored and catalogd daemons.

Example of Configuring HAProxy Load Balancer for Impala

If you are not already using a load-balancing proxy, you can experiment with HAProxy a free, open source load balancer. This example shows how you might install and configure that load balancer on a Red Hat Enterprise Linux system.

Install the load balancer: yum install haproxy
Set up the configuration file: /etc/haproxy/haproxy.cfg See below for a sample configuration file for one particular load balancer (HAProxy).
Run the load balancer (on a single host, preferably one not running impalad): /usr/sbin/haproxy –f /etc/haproxy/haproxy.cfg
In impala-shell, JDBC applications, or ODBC applications, connect to haproxy_host:25003, rather than port 25000 on a host actually running impalad.

This is the sample haproxy.cfg used in this example.

global
    # To have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events.  This is done
    #    by adding the '-r' option to the SYSLOGD_OPTIONS in
    #    /etc/sysconfig/syslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    #   file. A line like the following can be added to
    #   /etc/sysconfig/syslog
    #
    #    local2.*                       /var/log/haproxy.log
    #
    log         127.0.0.1 local0
    log         127.0.0.1 local1 notice
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option http-server-close
    option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 3
    maxconn                 3000
    contimeout 5000
    clitimeout 50000
    srvtimeout 50000

#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
    balance
    mode http
    stats enable
    stats auth username:password

# This is the setup for Impala. Impala client connect to load_balancer_host:25003.
# HAProxy will balance connections among the list of servers listed below.
# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
listen impala :25003
    mode tcp
    option tcplog
    balance leastconn

    server symbolic_name_1 impala-host-1.example.com:21000
    server symbolic_name_2 impala-host-2.example.com:21000
    server symbolic_name_3 impala-host-3.example.com:21000
    server symbolic_name_4 impala-host-4.example.com:21000

Page generated September 3, 2015.