Hive and Impala queries life cycle

Hive and Impala queries life cycle

Last update March 29th 2017

The editor is used a lot for querying Hive and Impala. Here are some tips on how to manage their resource.

But what happens to the query results? How long are they kept? Why do they disappear sometimes? Why are some Impala queries are still “in flight” even if they are completed?   Each query is using some resources in Impala or HiveServer2. When the users submit a lot of queries, they are going to add up and crash the servers if nothing is done. Here are the latest settings that you can tweak:

Impala

Hue tries to close the query when the user navigates away from the result page (as queries are generally fast, it is ok to close them quick). However, if the user never comes back checking the result of the query or never close the page, the query is going to stay. Impala is going to automatically expire the queries idle for than 10 minutes with the query_timeout_s property.


[impala]
  # If > 0, the query will be timed out (i.e. cancelled) if Impala does not do any work
 # (compute or send back results) for that query within QUERY_TIMEOUT_S seconds.
 query_timeout_s=600

 # If > 0, the session will be timed out (i.e. cancelled) if Impala does not do any work
 # (compute or send back results) for that session within QUERY_TIMEOUT_S seconds (default 1 hour).
 session_timeout_s=3600 

Until this version, the only alternative workaround to close all the queries, is to restart Hue (or Impala).

Note: Impala currently only cancels the query but does not close it. It will be improved in a future version with IMPALA-1575. In the meantime specify a -idle_session_timeout=20 in the Impala flags (“Command Line Argument Advanced Configuration Snippet (Safety Valve)”). This setting is also available in the Hue configuration.

Hive

Hue never closes the Hive queries by default (as some queries can take hours of processing time). Also if your query volume is low (e.g. < a few hundreds a day) and you restart HiveServer2 every week, you are probably not affected. To get the same behavior as Impala (and close the query when the user leaves the page), switch on in the hue.ini:

[beeswax]
# Hue will try to close the Hive query when the user leaves the editor page.
# This will free all the query resources in HiveServer2, but also make its results inaccessible.
close_queries=true

Starting in CDH5 and CDH4.6 (with HiveServer2), some close_query and close_session commands were added to Hue.

build/env/bin/hue close_queries --help

Usage: build/env/bin/hue close_queries [options] &amp;amp;amp;lt;age_in_days&amp;amp;amp;gt;  (default is 7)

Closes the non running queries older than 7 days. If <all> is specified, close the ones of any types.   To run them while using Cloudera Manager, be sure to export these two environment variables:

export HUE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`"

./build/env/bin/hue close_queries 0
Closing (all=False) HiveServer2 queries older than 0 days...
1 queries closed.

./build/env/bin/hue close_sessions 0 hive
Closing (all=False) HiveServer2 sessions older than 0 days...
1 sessions closed.

You can then add this commands into a crontab and expire the queries older than N days.

Note

When using Kerberos you also need:

export HIVE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`/hive-conf"

A cleaner solution comes with HIVE-5799 (available in Hive 0.14 or C5.2). Like Impala, HiveServer2 can now automatically expires queries. So tweak hive-site.xml with:

<property>
<name>hive.server2.session.check.interval</name>
<value>3000</value>
<description>The check interval for session/operation timeout, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
<name>hive.server2.idle.session.timeout</name>
<value>3000</value>
<description>Session will be closed when it's not accessed for this duration, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
<name>hive.server2.idle.operation.timeout</name>
<value>0</value>
<description>Operation will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero value. With positive value, it's checked for operations in terminal state only (FINISHED, CANCELED, CLOSED, ERROR). With negative value, it's checked for all of the operations regardless of state.</description>
</property>

Note

This is the recommended solution for Hive. User wishing to keep some result for longer can issue a CREATE TABLE AS SELECT … or export the results in Hue.

Sum-up

The query servers are becoming much more stable with these changes as their resources do not need to grow infinitely. One tradeoff though is that the user will lose his query results after a certain time. To make the experience better, several ideas are being explored, like automatically downloading N rows of the resultset and keeping them for longer.

As usual feel free to comment and send feedback on the hue-user list or @gethue!

26 Comments

  1. Marek 2 years ago

    > To run them while using Cloudera Manager, be sure to export these two environment variables (…)

    How can I run close_queries in CM?

    What are the *two* environment variables that need to be set? I can only see HUE_CONF_DIR in the post.

  2. sudheer 2 years ago

    Hi Team,

    I am facing Two problems in HUE. Could you please help in resolving.

    1) Not able to create Table/run SQL in HIVE through HUE.

    I am able to create Tables in Warehouse.db in Hive through Command Promt.

    Also I am able to run the Map reduce and Pig programs through Command promt.

    But I am not able to run anything from HUE.( Not even HIVE, PIG).

    The databases in HIVE are not listing out in HUE interface for my ID ( Warehouse.db). Only Databases present in Impala are listing in HUE . Also In Impala I see only few databases are listing out.

  3. sudheer 2 years ago

    Thanks for your update.

    I am not getting any errors.
    I could see all the databases from HIVE command prompt but whereas I could see only few of the databases from HUE WEB interface after selecting query editor tab where Hive.
    How to see the remaining databases from HUE WEB interface.

    • Hue Team 2 years ago

      This probably just means that you are not pointing to the same HiveServer2, you can see the one used by Hue on the /desktop/dump_config page, [beeswax] tab

  4. ankush 2 years ago

    Hi, How to cleaning up hue sessions from server,

  5. ankush 2 years ago

    yes, In the hue server, i want to cleaning up hue sessions from server.
    is any process or command to cleaning up the session in the hue browser ?

  6. Suresh 2 years ago

    Hi,
    how many ways to export MySQL DATA to Cloudera Hadoop file system?
    Shall we connect Tableau and Impala?

  7. pranav 1 year ago

    Hi

    prior to my upgrade ; In the previous version of Hue query results(executed in hive) were still available at least 24 hours after the query was run. Now they are expiring in 7 hours or less. Many of my queries are run overnight and i need the results to be available the next day.
    i currently have cdh 5.7 and bda 4.5

    Please help in retaining my queries for a longer duration

    • Hue Team 1 year ago

      Hive query timeout are set at the HiveServer2 level or when you restart Hue.

  8. zhuixun 6 months ago

    Hi
    I user hue notebook hive query. once in a while HiveServer2 Java heap space.
    ./build/env/bin/hue close_sessions 0 hive .Memory is getting higher and higher.
    thanks.

    hue 3.10
    hive 1.2.1

    • Author
      Hue Team 6 months ago

      What is the idle session timeout for queries and sessions in HiveServer?

  9. arjun 2 months ago

    Hi Team,

    I have a question regarding impala settings option available in impala-editor. I was looking for information on where it gets persisted ( in hue database?). Does setting parameters set remain active even after restart?

    Thanks in advance !

    • Author
      Hue Team 2 months ago

      There is a jira about it but it is not done yet. Usually admins sets the settings at the Impala level. Which ones are you looking at in particular?

  10. arjun 2 months ago

    Thanks for the response! My version of impala does not support setting default memory limit for a query in queue. I was looking if it can be set for all hue users using backend db.

  11. arjun 2 months ago

    Hi Team

    Below documentation says ‘In CDH 4, admission control will only work if you don’t have Hue deployed; unclosed Hue queries will accumulate and exceed the queue size limit’. Are there any restrictions or best practises on using Hue 3.9 or later and Impala 2.2 or later with admission control enabled ?

    https://www.cloudera.com/documentation/cdh/5-0-x/Impala/Installing-and-Using-Impala/ciiu_admission.html

Leave a reply

Your email address will not be published. Required fields are marked *

*