Impala HA: how to distribute Impala query load

Impala HA: how to distribute Impala query load

Hue provides an interface for Impala, the next generation SQL engine for Hadoop. In order to offer even more performances, Hue can distribute the query load across all of the Impala workers.

Tutorial

This tutorial demonstrates how to setup Hue to query multiple Impalads (Impala daemons):

  1. Configuring Hue 3.6 on one node in a 4 node RedHat 6 cluster to work with multiple Impalads.
  2. Load balance the connections to impalad using HAProxy 1.4, but any load balancer that persists connections should work.

Here is a quick video demonstrating how to communicate with multiple Impalads in Hue!

Configuring Hue

There are two ways to configure Hue to communicate with multiple Impalads.

Configuration via Cloudera Manager

  1. From Cloudera Manager, click on “Clusters” in the menu and find your Hue service.
    impala-1
  2. From the Hue service, go to “Configuration -> View and Edit”
    impala-2
  3. We must provide a safety valve configuration in Cloudera Manager to use the appropriate load balancer and socket timeout. Go to “Service-Wide -> Advanced” and click on the value for “Hue Service Advanced Configuration Snippet (Safety Valve)”. You can use the following as a template for the value:
    [impala]
    server_host=<hostname running HAProxy>
    server_port=<port HAProxy is bound to>
    server_conn_timeout=<timeout in seconds>
    

 

For more information on configuring Hue via Cloudera Manager, see Managing Clusters.

Manual configuration

  1. Open /etc/hue/hue.ini with your favorite text editor.
  2. Change the config “server_conn_timeout” under the “impala” section to a large value (e.g. 1 hour). This value should be in seconds (e.g. 1 hour = 3600 seconds). See item #4 in “Configuration via Cloudera Manager” for information on configuration option.
    impala-3
  3. Next, we must set the new host and port in the “impala” section in the hue.ini. The hostname is defined in “server_host” and the port is defined in “server_port”. See item #5 in “Configuration via Cloudera Manager” for an example configuration.

 

HA Proxy Installation/Configuration

  1. Download and unzip the binary distribution of HA Proxy 1.4 on the node that doesn’t have Hue installed.
  2. Add the following HA Proxy configuration to /etc/impala/haproxy-impala.conf:
global
  daemon
  nbproc 1
  maxconn 100000
  log /dev/log local6

defaults
  log        global
  mode       tcp
  option     tcplog
  option     tcpka
  timeout connect 3600000ms
  timeout client 3600000ms
  timeout server 3600000ms

listen impala
  bind 0.0.0.0:10001
  balance leastconn

  server impala1 server1.cloudera.com:21050 check
  server impala2 server2.cloudera.com:21050 check
  server impala3 server3.cloudera.com:21050 check
  1. Start HA Proxy:
haproxy -f /etc/impala/haproxy-impala.conf

 

The key configuration options are balance and server in the listen section. As well as the timeout configuration options in the defaults section. When the balance parameter is set to leastconn, Hue is guaranteed to create new connections with the impalad with the least number of connections. The server parameters define which servers will be used for load balancing and takes on the form:

 

server <name> <address>[:port] [settings ...]

 

In the configuration above, the server “impala1” is available at “impala1.cloudera.com:21050”, “impala2” is available at “impala2.cloudera.com:21050”, and “impala3” is available at “impala3.cloudera.com:21050”. The timeout configuration parameters define how long a TCP connection (on both sides) should live. In this example, the client timeout, server timeout, and connect timeout are all set at 1 hour.

 

HA Proxy is configured to bind to “0.0.0.0:10001”. Thus, Hue should now be able to point to HA Proxy, which will transparently pick one of the least utilized Impalads.

 

 

Conclusion

Load balancing Impalas’ queries will distribute the load to all the Impalads (where the final result aggregation happens for example). Impala currently requires non-volatile network connectivity by design so Hue can persist connections. We hope this helps you make the most of your Hadoop cluster!

 

Have any suggestions? Feel free to tell us what you think through hue-user or @gethue.

11 Comments

  1. Impala User 4 years ago

    The following is only true if all Impala users are going through HAProxy, since HAProxy is simply maintaining its own log of connections and isn’t using anything like a web service interface to actually query the status of all of the Impala Daemons on the cloud:

    “When the balance parameter is set to leastconn, Hue is guaranteed to create new connections with the impalad with the least number of connections”

    My experience is that most Impala users are NOT accessing Impala via Hue and are not accessing Impala via HAProxy, so setting up HAProxy this way isn’t very useful.

    • Hue Team 4 years ago

      FYI we know tens of organizations using the Impala Editor to query Impala.

  2. Jake 3 years ago

    When using the “leastconn” balance method Hue seems to have an issue when running multiple long-running queries. The long running queries seem to cancel each other because the thrift session is not properly maintained for each query. Using “balance source” will resolve this issue, but doesn’t distribute the load as well. In order to distribute the load, you will need to run multiple instances of Hue and use a load balancer in front of hue as well.

    • Hue Team 3 years ago

      Yes, in the current system an Impalad requires one connection by session by user, so it can go up quickly and “fixed round robin” is better right now (and not truly balancing.. so).

  3. Beyound 3 years ago

    If I already setting impala HA, can also setting all like this?
    I set haporxy conf in /etc/haproxy/haproxy.cfg , listen port25005 ,the others are 21000
    if I adding Hue to manage impala, what kind of set should i do?
    I have tried this web tutor, but my hue impala query didn’t work

    • Hue Team 3 years ago

      You should point to the proxy, so port 25005

  4. Beyound 3 years ago

    thanks,but I have a question

    If I already set the configuration for impala , should I create a new config file for hue? or use the same config file with the same content?

    this is my conf:
    ————————————for impala HA (/etc/haproxy/haproxy.cfg)————————-
    # To have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events. This is done
    # by adding the ‘-r’ option to the SYSLOGD_OPTIONS in
    # /etc/sysconfig/rsyslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    # file. A line like the following can be added to
    # /etc/sysconfig/syslog
    #
    # local2.* /var/log/haproxy.log
    #
    log 127.0.0.1 local0
    log 127.0.0.1 local1 notice
    chroot /var/lib/haproxy
    pidfile /var/run/haproxy.pid
    maxconn 4000
    user haproxy
    group haproxy
    daemon
    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats
    # common defaults that all the ‘listen’ and ‘backend’ sections will
    # use if not designated in their block
    #
    # You might need to adjust timing values to prevent timeouts.
    #———————————————————————
    defaults
    mode http
    log global
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    retries 3
    maxconn 3000
    timeout server 30s
    timeout connect 30s
    timeout client 30s
    #
    # This sets up the admin page for HA Proxy at port 25002.
    #
    #listen stats 172.16.180.197:25002
    #
    # balance
    # mode http
    # stats enable
    #stats auth root:root123

    # This is the setup for Impala. Impala client connect to load_balancer_host:25003.
    # HAProxy will balance connections among the list of servers listed below.
    # the list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
    # For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.

    listen impala
    bind 0.0.0.0:25003
    mode tcp
    option tcplog
    balance leastconn

    server master 172.16.180.197:21000 check

    server slave1 172.16.180.198:21000 check

    server slave2 172.16.180.199:21000 check

    listen hue
    bind 0.0.0.0:10001
    mode tcp

    option tcplog

    balance leastconn

    server master 172.16.180.197:21050 check

    server slave1 172.16.180.198:21050 check

    server slave2 172.16.180.199:21050 check

  5. Beyound 3 years ago

    if I use the same config (like /etc/haproxy/haproxy.cfg)for hue and impala HA, when I use the port 21000, it can use on command (like impala-shell -i 172.16.180.197:25003) but in hue didn’t work, if i use port 21050 the command didn’t connect, but hue can work, so where’s the prblem, or i have to use the separated config for impalaHA & hue? and the deploy are:

    # To have these messages end up in /var/log/haproxy.log you will
    # need to:
    #
    # 1) configure syslog to accept network log events. This is done
    # by adding the ‘-r’ option to the SYSLOGD_OPTIONS in
    # /etc/sysconfig/rsyslog
    #
    # 2) configure local2 events to go to the /var/log/haproxy.log
    # file. A line like the following can be added to
    # /etc/sysconfig/syslog
    #
    # local2.* /var/log/haproxy.log
    #
    log 127.0.0.1 local0
    log 127.0.0.1 local1 notice
    chroot /var/lib/haproxy
    pidfile /var/run/haproxy.pid
    maxconn 4000
    user haproxy
    group haproxy
    daemon
    # turn on stats unix socket
    #stats socket /var/lib/haproxy/stats
    # common defaults that all the ‘listen’ and ‘backend’ sections will
    # use if not designated in their block
    #
    # You might need to adjust timing values to prevent timeouts.
    #———————————————————————
    defaults
    mode http
    log global
    option httplog
    option dontlognull
    option http-server-close
    option forwardfor except 127.0.0.0/8
    option redispatch
    retries 3
    maxconn 3000
    timeout server 30s
    timeout connect 30s
    timeout client 30s
    #
    # This sets up the admin page for HA Proxy at port 25002.
    #
    #listen stats 172.16.180.197:25002
    #
    # balance
    # mode http
    # stats enable
    #stats auth root:root123

    # This is the setup for Impala. Impala client connect to load_balancer_host:25003.
    # HAProxy will balance connections among the list of servers listed below.
    # the list of Impalad is listening at port 21000 for beeswax (impala-shell) or original ODBC driver.
    # For JDBC or ODBC version 2.x driver, use port 21050 instead of 21000.

    listen impala
    bind 0.0.0.0:25003
    mode tcp
    option tcplog
    balance leastconn

    server master 172.16.180.197:21000 check

    server slave1 172.16.180.198:21000 check

    server slave2 172.16.180.199:21000 check

  6. Dr.Rizz 2 years ago

    I enabled HA proxy in our dev cluster. I had to setup “Impala Daemons Load Balancer” with proxy_server:port and restart impala before using proxy. Only then I was able to use impala shell to connect to daemons through proxy.

    However, I was no longer able to connect to impala daemons directly using Impala shell.

    Are proxy and direct connections mutually exclusive?

Leave a reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.