Configuring HiveServer2 High Availability in CDH

To enable high availability for multiple HiveServer2 hosts, configure a load balancer to manage them. To increase stability and security, configure the load balancer on a proxy server. The following sections describe how to enable high availability by using Cloudera Manager or how to enable it manually for unmanaged clusters.

Enabling HiveServer2 High Availability Using Cloudera Manager

Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)

  1. Add multiple HiveServer2 instances to your cluster:
    1. Go to the Hive service.
    2. Click the Instances tab, and then click Add Role Instances.
    3. On the Add Role Instances to Hive page under the HiveServer2 column heading, click Select hosts, and select the hosts that should have a HiveServer2 instance.
    4. Click OK, and then click Continue. You will be brought to the Instances page where you can start the new HiveServer2 instances.
  2. Click the Configuration tab.
  3. Select Scope > HiveServer2.
  4. Select Category > Main.
  5. Locate the HiveServer2 Load Balancer property or search for it by typing its name in the Search box.
  6. Enter values for <hostname>:<port number>. For example, hs2load_balancer.example.com:10015.
  7. Click Save Changes to commit the changes.
  8. Restart the Hive service.

Configuring HiveServer2 to Load Balance Behind a Proxy on Unmanaged Clusters

For unmanaged clusters with multiple users and availability requirements, you can configure a proxy server to relay requests to and from each HiveServer2 host. Applications connect to a single well-known host and port, and connection requests to the proxy succeed even when hosts running HiveServer2 become unavailable.

Unmanaged Clusters with Kerberos Enabled

  1. Create the hive/<load_balancer_fully_qualified_domain_name> principal and the merged keytab.

    • If you are using MIT Kerberos, connect to the KDC as root and run the following command in kadmin.local. Replace <load_balancer_fully_qualified_domain_name> with the fully qualified domain name of the load balancer host:

      kadmin.local:  addprinc -randkey hive/<load_balancer_fully_qualified_domain_name>
                
    • If you are using Microsoft Active Directory for your KDC, see Microsoft documentation to create a principal and keytab for Hive. The principal must be named hive/<load_balancer_fully_qualified_domain_name> and the keytab must contain all of the Hive host keytabs for your cluster.

      For example, if your load balancer is hs2loadbalancer.example.com and you have two HiveServer2 instances on host hs2-host-1.example.com and hs2-host-2.example.com, if you run klist -ekt hive-proxy.keytab, it should return the following:

      [root@cdh_user-linux named]# klist -ekt /tmp/hive-proxy.keytab
      Keytab name: FILE:/tmp/hive-proxy.keytab
      KVNO Timestamp           Principal
      ---- ------------------- ------------------------------------------------------
         1 09/08/2015 12:46:25 hive/hs2loadbalancer.example.com@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
         2 09/08/2015 12:46:37 hive/hs2-host-1.example.com@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
         2 09/08/2015 12:46:42 hive/hs2-host-2.example.com@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
                
  2. While you are still connected to kadmin.local, list the hive/<hs2_hostname> principals:
    kadmin.local:  listprincs hive/*
    hive/hs2-host-1.example.com@EXAMPLE.COM
    hive/hs2-host-2.example.com@EXAMPLE.COM
            
  3. While you are still connected to kadmin.local, create a hive-proxy.keytab, which contains the load balancer and all of the hive/<hs2_hostname> principals:

    kadmin.local:  xst -k /tmp/hive-proxy.keytab -norandkey hive/hs2loadbalancer.example.com
    kadmin.local:  xst -k /tmp/hive-proxy.keytab -norandkey hive/hs2-host-1.example.com
    kadmin.local:  xst -k /tmp/hive-proxy.keytab -norandkey hive/hs2-host-2.example.com
              

    Note that a single xst is used per entry, which appends each entry to the keytab. Also note that the -norandkey parameter is specified. This is required so you do not break existing keytabs.

  4. Validate the keytab by running klist:
    [root@cdh_user-linux named]# klist -ekt /tmp/hive-proxy.keytab
    Keytab name: FILE:/tmp/hive-proxy.keytab
    KVNO Timestamp           Principal
    ---- ------------------- ------------------------------------------------------
       1 09/08/2015 12:46:25 hive/hs2loadbalancer.example.com@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
       2 09/08/2015 12:46:37 hive/hs2-host-1.example.com@EXAMPLE.COM (aes256-cts-hmac-sha1-96)
       2 09/08/2015 12:46:42 hive/hs2-host-2.example.com@EXAMPLE.COM (aes256-cts-hmca-sha1-96)
              
  5. Distribute the hive-proxy.keytab to all HiveServer2 hosts. Make sure that /var/lib/hive exists on each node and copy the hive-proxy.keytab to /var/lib/hive on each node. Then confirm that permissions are set to hive:hive on the directory and the keytab:
    [root@cdh5xx-1 ~]# rm -f /var/lib/hive/hive-proxy.keytab
    [root@cdh5xx-1 ~]# mkdir -p /var/lib/hive
    [root@cdh5xx-1 ~]# cp /tmp/hive-proxy.keytab /var/lib/hive
    [root@cdh5xx-1 ~]# chown -R hive:hive /var/lib/hive
    [root@cdh5xx-1 ~]# ls -lart /var/lib/hive/
    total 16
    drwxr-xr-x. 49 root root 4096 Jun  7 17:39 ..
    -rw-r--r--   1 hive hive  983 Jun  7 17:40 hive.keystore
    -rw-------   1 hive hive 1412 Sep  8 14:38 hive-proxy.keytab
    drwxr-xr-x   2 hive hive 4096 Sep  8 14:38 .
              
  6. Configure HiveServer2 to use the new keytab and load balancer principal by setting the hive.server2.authentication.kerberos.principal and the hive.server2.authentication.kerberos.keytab properties in the hive-site.xml file. For example, to set these properties for the examples used in the above steps, your hive-site.xml is set as follows:
    <property>
       <name>hive.server2.authentication.kerberos.principal</name>
       <value>hive/hs2loadbalancer.example.com@EXAMPLE.COM</value>
    </property>
    <property>
       <name>hive.server2.authentication.kerberos.keytab</name>
       <value>/var/lib/hive/hive-proxy.keytab</value>
    </property>
              
  7. Restart the Hive service.
  8. Download load-balancing proxy software of your choice on a single host. For example, see Example HAProxy Configuration.
  9. Configure the software, typically by editing a configuration file. Usually this configuration includes:
    1. Setting the port for the load balancer to listen on and to relay HiveServer2 requests back and forth.
    2. Setting the port and hostname for each HiveServer2 host. These are the hosts from which the load balancer chooses when relaying each query.
  10. Run the load-balancing proxy server and point it at the configuration file.
  11. Point all scripts, jobs, or application configurations to the new proxy server instead of any specific HiveServer2 instance.

Unmanaged Clusters WITHOUT Kerberos

To configure HiveServer2 for high availability for unmanaged clusters, use the following steps.

  1. Download load-balancing proxy software of your choice on a single host. For example, see Example HAProxy Configuration.
  2. Configure the software, typically by editing a configuration file. Usually this configuration includes:
    1. Setting the port for the load balancer to listen on and to relay HiveServer2 requests back and forth.
    2. Setting the port and hostname for each HiveServer2 host. These are the hosts from which the load balancer chooses when relaying each query.
  3. Run the load-balancing proxy server and point it at the configuration file.
  4. Point all scripts, jobs, or application configurations to the new proxy server instead of any specific HiveServer2 instance.

Example HAProxy Configuration

If you are not already using a load-balancing proxy, you can experiment with HAProxy a free, open source load balancer.

To install and configure HAProxy, an open source load balancer, perform the following steps.

  1. Download the appropriate from the HAProxy web site.
  2. As the root user, install HAProxy:

    sudo yum -y install haproxy
                  
  3. Edit the HAProxy configuration file to listen on port 10000 and point to each HiveServer2 instance. Make sure to configure for sticky sessions. Here is an example configuration file:

    global
        # To have these messages end up in /var/log/haproxy.log you will
        # need to:
        #
        # 1) configure syslog to accept network log events.  This is done
        #    by adding the '-r' option to the SYSLOGD_OPTIONS in
        #    /etc/sysconfig/syslog
        #
        # 2) configure local2 events to go to the /var/log/haproxy.log
        #   file. A line like the following can be added to
        #   /etc/sysconfig/syslog
        #
        #    local2.*                       /var/log/haproxy.log
        #
        log         127.0.0.1 local0
        log         127.0.0.1 local1 notice
        chroot      /var/lib/haproxy
        pidfile     /var/run/haproxy.pid
        maxconn     4000
        user        haproxy
        group       haproxy
        daemon
    
        # turn on stats unix socket
        #stats socket /var/lib/haproxy/stats
    
    #---------------------------------------------------------------------
    # common defaults that all the 'listen' and 'backend' sections will
    # use if not designated in their block
    #
    # You might need to adjust timing values to prevent timeouts.
    #---------------------------------------------------------------------
    defaults
        mode                    http
        log                     global
        option                  httplog
        option                  dontlognull
        option http-server-close
        option forwardfor       except 127.0.0.0/8
        option                  redispatch
        retries                 3
        maxconn                 3000
        contimeout 5000
        clitimeout 50000
        srvtimeout 50000
    
    #
    # This sets up the admin page for HA Proxy at port 25002.
    #
    listen stats :25002
        balance
        mode http
        stats enable
        stats auth username:password
    
    # This is the setup for HS2. beeline client connect to load_balancer_host:10001.
    # HAProxy will balance connections among the list of servers listed below.
    listen hiveserver2 :10001
        mode tcp
        option tcplog
        balance source
    ​
        server hiveserver2_1 hs2-host-1.example.com:10000
        server hiveserver2_2 hs2-host-2.example.com:10000
        server hiveserver2_3 hs2-host-3.example.com:10000
        server hiveserver2_4 hs2-host-4.example.com:10000
                  
  4. Set HAProxy to start when the system starts:

    chkconfig haproxy on
                  
  5. Start HAProxy:

    service haproxy start