Hive and Impala queries life cycle

17 September 2014 in Administration / Querying - 4 minutes read

Last update March 29th 2017

The editor is used a lot for querying Hive and Impala. Here are some tips on how to manage their resource.

But what happens to the query results? How long are they kept? Why do they disappear sometimes? Why are some Impala queries are still “in flight” even if they are completed?   Each query is using some resources in Impala or HiveServer2. When the users submit a lot of queries, they are going to add up and crash the servers if nothing is done. Here are the latest settings that you can tweak:

Impala

Hue tries to close the query when the user navigates away from the result page (as queries are generally fast, it is ok to close them quick). However, if the user never comes back checking the result of the query or never close the page, the query is going to stay. Impala is going to automatically expire the queries idle for than 10 minutes with the query_timeout_s property.



[impala]

\# If > 0, the query will be timed out (i.e. cancelled) if Impala does not do any work

\# (compute or send back results) for that query within QUERY_TIMEOUT_S seconds.

query_timeout_s=600

\# If > 0, the session will be timed out (i.e. cancelled) if Impala does not do any work

\# (compute or send back results) for that session within QUERY_TIMEOUT_S seconds (default 1 hour).

session_timeout_s=3600 

Until this version, the only alternative workaround to close all the queries, is to restart Hue (or Impala).

Note: Impala currently only cancels the query but does not close it. It will be improved in a future version with IMPALA-1575. In the meantime specify a -idle_session_timeout=20 in the Impala flags (“Command Line Argument Advanced Configuration Snippet (Safety Valve)”). This setting is also available in the Hue configuration.

Hive

Hue never closes the Hive queries by default (as some queries can take hours of processing time). Also if your query volume is low (e.g. < a few hundreds a day) and you restart HiveServer2 every week, you are probably not affected. To get the same behavior as Impala (and close the query when the user leaves the page), switch on in the hue.ini:

[beeswax]

\# Hue will try to close the Hive query when the user leaves the editor page.

\# This will free all the query resources in HiveServer2, but also make its results inaccessible.

close_queries=true

Starting in CDH5 and CDH4.6 (with HiveServer2), some close_query and close_session commands were added to Hue.

build/env/bin/hue close_queries -help

Usage: build/env/bin/hue close_queries [options] &amp;amp;lt;age_in_days&amp;amp;gt; (default is 7)

Closes the non running queries older than 7 days. If is specified, close the ones of any types.   To run them while using Cloudera Manager, be sure to export these two environment variables:

export HUE_CONF_DIR="/var/run/cloudera-scm-agent/process/\`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'\`"

./build/env/bin/hue close_queries 0

Closing (all=False) HiveServer2 queries older than 0 days...

1 queries closed.

./build/env/bin/hue close_sessions 0 hive

Closing (all=False) HiveServer2 sessions older than 0 days...

1 sessions closed.

You can then add this commands into a crontab and expire the queries older than N days.

Note

When using Kerberos you also need:

export HIVE_CONF_DIR="/var/run/cloudera-scm-agent/process/\`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'\`/hive-conf"

A cleaner solution comes with HIVE-5799 (available in Hive 0.14 or C5.2). Like Impala, HiveServer2 can now automatically expires queries. So tweak hive-site.xml with:

<property>

<name>hive.server2.session.check.interval</name>

<value>3000</value>

<description>The check interval for session/operation timeout, which can be disabled by setting to zero or negative value.</description>

</property>

<property>

<name>hive.server2.idle.session.timeout</name>

<value>3000</value>

<description>Session will be closed when it's not accessed for this duration, which can be disabled by setting to zero or negative value.</description>

</property>

<property>

<name>hive.server2.idle.operation.timeout</name>

<value>0</value>

<description>Operation will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero value. With positive value, it's checked for operations in terminal state only (FINISHED, CANCELED, CLOSED, ERROR). With negative value, it's checked for all of the operations regardless of state.</description>

</property>

Note

This is the recommended solution for Hive. User wishing to keep some result for longer can issue a CREATE TABLE AS SELECT … or export the results in Hue.

Sum-up

The query servers are becoming much more stable with these changes as their resources do not need to grow infinitely. One tradeoff though is that the user will lose his query results after a certain time. To make the experience better, several ideas are being explored, like automatically downloading N rows of the resultset and keeping them for longer.

As usual feel free to comment and send feedback on the hue-user list or @gethue!


comments powered by Disqus

More recent stories

13 November 2019
Visually surfacing SQL information like Primary Keys, Foreign Keys, Views and Complex Types
Read More
31 October 2019
Missing some color? How to improve or add your own SQL syntax Highlighter
Read More
24 October 2019
How to create a HBase table on Kerberized Hadoop clusters
Read More