How to optimally configure your Analytic Database for High Availability with Hue and other SQL clients

How to optimally configure your Analytic Database for High Availability with Hue and other SQL clients

Hi Big Data Explorers,

HiveServer2 and Impala support High Availability through a “load balancer”. One caveat is that Hue’s underlying Thrift libraries reuse TCP connections in a pool, a single user session may not have the same Impala or Hive TCP connection. If a TCP connection is balanced away from the previously selected  HiveServer2 or Impalad instance, the user session and its queries can be lost and trigger the “Results have expired” or “Invalid session Id” errors.

To prevent sessions from being lost, we need configure the load balancer with “source” algorithm to ensure each Hue instance sends all traffic to a single HiveServer2/Impalad instance. Yes, this is not true load balancing, but a configuration for failover High Availability. HiveSever2 or Impala coordinators already distribute the work across the cluster so this is not an issue.

To enable an optimal load distribution that works for everybody, we can create multiple profiles in our load balancer, per port for Hue clients and non-Hue clients like Hive or Impala. We can configure non-Hue clients to distribute loads with “roundrobin” or “leastconn” and configure Hue clients with “source” (source IP Persistence) on dedicated ports, for example, 10015 for Hive beeline commands, 10016 for Hue, 21051 for Hue-Impala interactions while 25003 for Impala shell.

As shown in above diagram, you can configure the HaProxy to have two different ports associated with different load balancing algorithms. Here is a sample configuration (haproxy.cfg) for Hive and Impala HA on a secure cluster.

#--------------------------------------------------------------------
# main frontend which proxys to the backends
#--------------------------------------------------------------------
frontend hiveserver2_front
bind *:10015 ssl crt /path/to/cert_key.pem
mode tcp
option tcplog
default_backend hiveserver2
#--------------------------------------------------------------------
# round robin balancing between the various backends
#--------------------------------------------------------------------
# This is the setup for HS2. beeline client connect to load_balancer_host:load_balancer_port.
# HAProxy will balance connections among the list of servers listed below.
backend hiveserver2
    balance                     roundrobin
    mode                        tcp
    server hiveserver2_1 host-2.com:10000 ssl ca-file /path/to/truststore.pem check
    server hiveserver2_2 host-3.com:10000 ssl ca-file /path/to/truststore.pem check
    server hiveserver2_3 host-1.com:10000 ssl ca-file /path/to/truststore.pem check

# Setup for Hue or other JDBC-enabled applications.
# In particular, Hue requires sticky sessions.
# The application connects to load_balancer_host:10016, and HAProxy balances
# connections to the associated hosts, where Hive listens for JDBC requests on port 10015.
#--------------------------------------------------------------------
# main frontend which proxys to the backends
#--------------------------------------------------------------------
frontend  hivejdbc_front
    bind                        *:10016 ssl crt /path/to/cert_key.pem
    mode                        tcp
    option                      tcplog
    stick                       match src
    stick-table type ip size 200k expire 30m
    default_backend             hivejdbc

#--------------------------------------------------------------------
# source balancing between the various backends
#--------------------------------------------------------------------
# HAProxy will balance connections among the list of servers listed below.
backend hivejdbc
    balance                     source
    mode                        tcp
    server hiveserver2_1 host-2.com:10000 ssl ca-file /path/to/truststore.pem check
    server hiveserver2_2 host-3.com:10000 ssl ca-file /path/to/truststore.pem check
    server hiveserver2_3 host-1.com:10000 ssl ca-file /path/to/truststore.pem check

And here is an example for impala HA configuration on a secure cluster.

# The list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
# For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.
#--------------------------------------------------------------------
# main frontend which proxys to the backends
#--------------------------------------------------------------------
frontend  impala_front
    bind                        *:25003 ssl crt /path/to/cert_key.pem
    mode                        tcp
    option                      tcplog
    default_backend             impala
#--------------------------------------------------------------------
# round robin balancing between the various backends
#--------------------------------------------------------------------
backend impala
    balance                     leastconn
    mode                        tcp
    server impalad1 host-3.com:21000 ssl ca-file /path/to/truststore.pem check
    server impalad2 host-2.com:21000 ssl ca-file /path/to/truststore.pem check
    server impalad3 host-4.com:21000 ssl ca-file /path/to/truststore.pem check

# Setup for Hue or other JDBC-enabled applications.
# In particular, Hue requires sticky sessions.
# The application connects to load_balancer_host:21051, and HAProxy balances
# connections to the associated hosts, where Impala listens for JDBC requests on port 21050.
#--------------------------------------------------------------------
# main frontend which proxys to the backends
#--------------------------------------------------------------------
frontend  impalajdbc_front
    bind                        *:21051 ssl crt /path/to/cert_key.pem
    mode                        tcp
    option                      tcplog
    stick                       match src
    stick-table type ip size 200k expire 30m
    default_backend             impalajdbc
#--------------------------------------------------------------------
# source balancing between the various backends
#--------------------------------------------------------------------
# HAProxy will balance connections among the list of servers listed below.
backend impalajdbc
    balance                     source
    mode                        tcp
    server impalad1 host-3.com:21050 ssl ca-file /path/to/truststore.pem check
    server impalad2 host-2.com:21050 ssl ca-file /path/to/truststore.pem check
    server impalad3 host-4.com:21050 ssl ca-file /path/to/truststore.pem check
Note: “check” is required at end of each line to ensure HaProxy can detect any unreachable Impalad/HiveServer2 server, so HA failover can be successful. Without TCP check, you may hit the “TSocket reads 0 byte” error when the Impalad/HiveServer2 server Hue tries to connect is down.

After editing the /etc/haproxy/haproxy.cfg file, run following commands to restart HaProxy service and check the service restarts successfully.

service haproxy restart
service haproxy status

Also we need add following blocks into hue.ini or if you are using Cloudera Manager, adding following block inside Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini

[impala]
server_port=21051

[beeswax]
hive_server_port=10016

Now you are all set for Hive/Impala High Availability setups! In the future, this configuration will be managed by Cloudera Manager so that your Analytic Database is HA out of the box.

6 Comments

  1. Ruslan 12 months ago

    That’s great. Thanks for sharing.
    Is there is a jira to make Hue possible to work with HA-enabled HS2 directly, without requirement to have a standalone load balancer?
    I guess Hue could read Hive config and implement failover in its own lagic rather than relying on additional software package to do load balancing / failover.

  2. impalaben 12 months ago

    Hello,
    Thanks for sharing.
    When i start de haproxy i got this errors
    parsing [/etc/haproxy/haproxy.cfg:81] : ‘stick’ ignored because frontend ‘hivejdbc_front’ has no backend capability.
    [WARNING] 004/150700 (26772) : parsing [/etc/haproxy/haproxy.cfg:124] : ‘stick’ ignored because frontend ‘impalajdbc_front’ has no backend capability.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for frontend ‘hiveserver2_front’ as it requires HTTP mode.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for backend ‘hiveserver2’ as it requires HTTP mode.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for frontend ‘hivejdbc_front’ as it requires HTTP mode.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for backend ‘hivejdbc’ as it requires HTTP mode.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for frontend ‘impala_front’ as it requires HTTP mode.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for backend ‘impala’ as it requires HTTP mode.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for frontend ‘impalajdbc_front’ as it requires HTTP mode.
    [WARNING] 004/150700 (26772) : config : ‘option forwardfor’ ignored for backend ‘impalajdbc’ as it requires HTTP mode.

    and when i excute an impala query with Hue i have
    Query 944953c633087c57:d9c5315300000000 expired due to client inactivity (timeout is 3m)
    any suggestions

  3. Author
    Hue Team 12 months ago

    Did you configure the common defaults to have http as mode? It should look like this:
    #———————————————————————
    # common defaults that all the ‘listen’ and ‘backend’ sections will
    # use if not designated in their block
    #———————————————————————
    defaults
    mode http
    log global

    Also the frontend and backend should be proper mapped, something like this:
    frontend hiveserver2_front
    bind *:10015 ssl crt /path/to/cert_key.pem
    mode tcp
    option tcplog
    default_backend hiveserver2

    #———————————————————————
    # round robin balancing between the various backends
    #———————————————————————
    # This is the setup for HS2. beeline client connect to load_balancer_host:load_balancer_port.
    # HAProxy will balance connections among the list of servers listed below.
    backend hiveserver2

    Hope this helps.

  4. impalaben 11 months ago

    hello Team ,
    Yes i have configure the common defaults to have http as mode and i don’t use secure mode ssl
    this is an example for my config

    global
    log 127.0.0.1 local0
    log 127.0.0.1 local1 notice
    chroot /var/lib/haproxy
    pidfile /var/run/haproxy.pid
    maxconn 100000
    user haproxy
    group haproxy
    daemon
    defaults
    mode http
    log global
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    retries 3
    maxconn 100000
    contimeout 15m
    clitimeout 60m
    srvtimeout 60m
    #——————————————————————–
    # main frontend which proxys to the backends
    #——————————————————————–
    frontend hiveserver2_front
    bind :10015
    mode tcp
    option tcplog
    default_backend hiveserver2
    backend hiveserver2
    balance roundrobin
    mode tcp
    server hiveserver2_1 myserver1:10000 check
    server hiveserver2_2 myserver2:10000 check
    ………

  5. DashV 9 months ago

    When load balancing hive server2 retrieving query results from the history only works intermittently. Is there a fix for this?

Leave a reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.