Season II: 2. Execute Hive queries and schedule them with Oozie

Published on 11 September 2013 in Tutorial - 5 minutes read - Last modified on 06 March 2021

In the previous episode, we saw how to to transfer some file data into Apache Hadoop. In order to interrogate easily the data, the next step is to create some Hive tables. This will enable quick interaction with high level languages like SQL and Pig.

 

We experiment with the SQL queries, then parameterize them and insert them into a workflow in order to run them together in parallel. Including Hive queries in an Oozie workflow is a pretty common use case with recurrent pitfalls as seen on the user group. We can do it with Hue in a few clicks.

Get prepared

First, based on the data of the previous episode we create two tables in the Hive Metastore. We use the Metastore app and its create table wizard. Then, it is time to study the data!

We previously created the Hive table in https://gethue.com/tutorial-analyzing-data-with-hue-and-hive/

 

 

Hive

Goal: we want to get the 10 coolest restaurants for a day.

 

Let’s open Beeswax Hive Editor and explore the range of dates that we have:

SELECT DISTINCT \`date\` FROM review ORDER BY \`date\` DESC;

Notice that you need to use backticks in order to use date as a column name in Hive.

 

The data is a bit old, so let’s pick 2012-12-01 as our target date. We can join the two tables in order to get the name of the restaurant and its average ‘cool’ score of the day. Submit this parameterized query and enter 2012-12-01 when prompted for the date:

 

SELECT r.business_id, name, AVG(cool) AS coolness

FROM review r JOIN business b

ON (r.business_id = b.business_id)

WHERE categories LIKE '%Restaurants%'

AND \`date\` = '$date'

GROUP BY r.business_id, name

ORDER BY coolness DESC

LIMIT 10

We have a good Hive query. Let’s create a result table ‘top_cool’ that will contain the top 10:

CREATE TABLE top_cool AS

SELECT r.business_id, name, SUM(cool) AS coolness, '$date' as \`date\`

FROM review r JOIN business b

ON (r.business_id = b.business_id)

WHERE categories LIKE '%Restaurants%'

AND \`date\` = '$date'

GROUP BY r.business_id, name

ORDER BY coolness DESC

LIMIT 10

And later replace ‘CREATE TABLE top_cool AS’ by ‘INSERT INTO TABLE top_cool’ in the Hive script as we want to create the table only the first time:

INSERT INTO TABLE top_cool

SELECT r.business_id, name, SUM(cool) AS coolness, '${date}' as \`date\`

FROM review r JOIN business b

ON (r.business_id = b.business_id)

WHERE categories LIKE '%Restaurants%'

AND \`date\` = '$date'

GROUP BY r.business_id, name

ORDER BY coolness DESC

LIMIT 10

Hive action in Apache Oozie

The video also starts here.

First we create a new workflow and add an Oozie action. We need to specify which SQL we want to run. This one needs to be uploaded to HDFS. In our case we open up the ‘workspace’ of the workflow, create a new file and copy paste the query. We we upload and pick the query file as the ‘Script name’.

 

Important

Then comes a crucial step. Our Hive action needs to talk to the Hive Metastore and so know its location. This is done by copying /etc/hive/conf/hive-site.xml as ‘hive-conf.xml’ on HDFS and including it as a ‘File’ resource and telling Oozie to use it as ‘Job XML’ configuration.

 

Note: when using a demo VM or a pseudo distributed cluster (everything on one machine), you might hit the error explained in the ‘Warning!’ section of the HCatalog post.

 

Note: when using a real cluster, as the workflow is going to run somewhere in the cluster, we need to the metastore to be remote. A remote Metastore can be contacted from any other hosts.

 

Lets specify that we are using a ‘date’ parameter in the Hive script. In our case we add the parameter in the Hive action:



date=${date}

The we save the workflow, fill up the date when prompted and look at the dynamic progress of the workflow! The output of the query will appear when you click on the ‘View the logs’ button on the action graph. In practice, INSERT, LOAD DATA would be used instead of SELECT in order to persist the calculation.

 

You can now monitor the workflow in the dashboard and stop or rerun it.

 

Note:

If you are seeing this error, it means that the input file or destination directory of the table is not writable by your user or the ‘hive’ user if you are with HiveServer2:

 

Failed with exception copyFiles: error while moving files!!!
 FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

 

Sum-up

Hive queries can be simply tested in Beeswax Hive Editor before getting inserted in an Oozie workflow, all without touching the command line.

 

One of the Hue 3 goal is to remove the duplication of the hive script on the HDFS and the manual creation of the Hive action. With the new document model, one would refer to the saved Hive query in Beeswax and with just a click create it.

 

Creating a workflow lets you group other scripts together and run them atomically. Another advantage is to then execute the workflow repetitively (e.g. run a query every day at midnight) with an Oozie coordinator.

This is what we will cover in the next episode!


comments powered by Disqus

More recent stories

03 May 2023
Discover the power of Apache Ozone using the Hue File Browser
Read More
23 January 2023
Hue 4.11 and its new dialects and features are out!
Read More