Impala HA: how to distribute Impala query load

19 May 2014 in Administration / Querying - 3 minutes read

Hue provides an interface for Impala, the next generation SQL engine for Hadoop. In order to offer even more performances, Hue can distribute the query load across all of the Impala workers.

Tutorial

This tutorial demonstrates how to setup Hue to query multiple Impalads (Impala daemons):

  1. Configuring Hue 3.6 on one node in a 4 node RedHat 6 cluster to work with multiple Impalads.
  2. Load balance the connections to impalad using HAProxy 1.4, but any load balancer that persists connections should work.

Here is a quick video demonstrating how to communicate with multiple Impalads in Hue!

Configuring Hue

There are two ways to configure Hue to communicate with multiple Impalads.

Configuration via Cloudera Manager

  1. From Cloudera Manager, click on “Clusters” in the menu and find your Hue service.

  2. From the Hue service, go to “Configuration -> View and Edit”

  3. We must provide a safety valve configuration in Cloudera Manager to use the appropriate load balancer and socket timeout. Go to “Service-Wide -> Advanced” and click on the value for “Hue Service Advanced Configuration Snippet (Safety Valve)”. You can use the following as a template for the value:

    [impala]
    
        server_host=<hostname running HAProxy>
    
        server_port=<port HAProxy is bound to>
    
        server_conn_timeout=<timeout in seconds>
    
        

 

For more information on configuring Hue via Cloudera Manager, see Managing Clusters.

Manual configuration

  1. Open /etc/hue/hue.ini with your favorite text editor.
  2. Change the config “server_conn_timeout” under the “impala” section to a large value (e.g. 1 hour). This value should be in seconds (e.g. 1 hour = 3600 seconds). See item #4 in “Configuration via Cloudera Manager” for information on configuration option.

  3. Next, we must set the new host and port in the “impala” section in the hue.ini. The hostname is defined in “server_host” and the port is defined in “server_port”. See item #5 in “Configuration via Cloudera Manager” for an example configuration.

 

HA Proxy Installation/Configuration

  1. Download and unzip the binary distribution of HA Proxy 1.4 on the node that doesn’t have Hue installed.
  2. Add the following HA Proxy configuration to /etc/impala/haproxy-impala.conf:
global

daemon

nbproc 1

maxconn 100000

log /dev/log local6

defaults

log global

mode tcp

option tcplog

option tcpka

timeout connect 3600000ms

timeout client 3600000ms

timeout server 3600000ms

listen impala

bind 0.0.0.0:10001

balance leastconn

server impala1 server1.cloudera.com:21050 check

server impala2 server2.cloudera.com:21050 check

server impala3 server3.cloudera.com:21050 check
  1. Start HA Proxy:
haproxy -f /etc/impala/haproxy-impala.conf

 

The key configuration options are balance and server in the listen section. As well as the timeout configuration options in the defaults section. When the balance parameter is set to leastconn, Hue is guaranteed to create new connections with the impalad with the least number of connections. The server parameters define which servers will be used for load balancing and takes on the form:

 

server <name> <address>[:port] [settings ...]

 

In the configuration above, the server “impala1” is available at “impala1.cloudera.com:21050”, “impala2” is available at “impala2.cloudera.com:21050”, and “impala3” is available at “impala3.cloudera.com:21050”. The timeout configuration parameters define how long a TCP connection (on both sides) should live. In this example, the client timeout, server timeout, and connect timeout are all set at 1 hour.

 

HA Proxy is configured to bind to “0.0.0.0:10001”. Thus, Hue should now be able to point to HA Proxy, which will transparently pick one of the least utilized Impalads.

 

 

Conclusion

Load balancing Impalas’ queries will distribute the load to all the Impalads (where the final result aggregation happens for example). Impala currently requires non-volatile network connectivity by design so Hue can persist connections. We hope this helps you make the most of your Hadoop cluster!

 

Have any suggestions? Feel free to tell us what you think through hue-user or @gethue.


comments powered by Disqus

More recent stories

13 November 2019
Visually surfacing SQL information like Primary Keys, Foreign Keys, Views and Complex Types
Read More
31 October 2019
Missing some color? How to improve or add your own SQL syntax Highlighter
Read More
24 October 2019
How to create a HBase table on Kerberized Hadoop clusters
Read More