Season II: 3. Schedule Hive queries with Oozie coordinators

Season II: 3. Schedule Hive queries with Oozie coordinators

In the previous episode we saw how to create an Hive action in an Oozie workflow. These workflows can then be repeated automatically with an Oozie coordinator. This post describes how to schedule Hadoop jobs (e.g. run this job everyday at midnight).

Oozie Coordinators

Our goal: compute the 10 coolest restaurants of the day everyday for 1 month:

 

From episode 2, now have a workflow ready to be ran everyday. We create a ‘daily_top’ coordinator and select our previous Hive workflow. Our frequency is daily, and we can start from November 1st 2012 12:00 PM to November 30th 2012 12:00 PM.

 

The most important part is to recreate a URI that represents the date of the data. Notice that there is more efficient way to do this but we have an example easier to understand.

 

As our data is already present, we just need to create an output dataset named ‘daily_days’ (which contrary to the input dataset won’t check if the input is available). We pick the URI of the data set to be like the date format of the episode one (e.g. $YEAR-$MONTH-$DAY). These parameters are going to be automatically filled in our workflow by the coordinator.

 

We now link our ‘daily_days’ dataset to our workflow variable ‘date’ and save the coordinator.

 

Notice that on Step 5 the  ’Oozie parameters’ list which is the equivalent of the coordinator.properties file. The values will appear in the submission pop-up an can be overridden. There are also ‘Workflow properties’  for fill-up workflow parameters directly (which can be parameterized themselves by ‘Oozie parameters’ or EL functions or constants). We will have more on this in the upcoming Oozie bundle episode.

 

Now submit the coordinator and see the 30 instances (one for each day of November) being  created and triggering the workflow with the Hive query for the corresponding day. Coordinators can also be stopped and re-ran through the UI. Each workflow can be individually accessed by simply clicking on the date instance.

 

Sum-up

With their input and output datasets Coordinators are great for scheduling repetitive workflows in a few clicks. Hue offers a UI and wizard that lets you avoid any Oozie XML. At some point, Hue will also make it even simpler by automating the creation of the workflow and coordinator: HUE-1389.

Next, let’s do fast SQL with Impala!

13 Comments

  1. Daniel 3 years ago

    It’s a pitty you didn’t show how you link the variables you defined to the hive script itself…

  2. Amar 3 years ago

    I generally run my queries on data for yesterday. I want to schedule such jobs through Oozie. Could someone please direct to some tutorial with all date related constants I can use.

  3. Anas A 2 years ago

    I can’t able to make a coordinator set to run for every 5 minute

  4. Arjun 2 years ago

    Hello ,
    I want to execute 10 hive actions or shell actions in parallel fashion. I tried building a fork and then added two actions , when i tried adding the third action , i was expecting it to be on the same level but in fact it gets pushed to a level lower. Is this a limitation in hue ? . My requirement is to have 10 actions under a fork

    • Hue Team 2 years ago

      It should be on the same level, just your screen might not be wide enough. Have you tried on Hue 3.8 / CDH5.4, the page now horizontally scrolls

  5. Manish 2 years ago

    How can I pass arguments from perl script to Hive script in Oozie

    I have two files
    1. script file (perl or bash script), which generate ‘where’ clause dynamically(Ex. date for which data is requried.).
    2. Hive script which consume those date as a arguments. and produced the dataset.

    I know that using “day>=${fromdate}” hive script will accept argement from outer world.
    My question: How can I pass those arguments to Hive script from another script, in Oozie workflow.

    I am using Hue for designing Oozie workflow

  6. sashi 11 months ago

    Hi romain,

    i have a workflow like below :

    export table to ‘/path in hdfs/test_${today}’;

    I would like to pass the current date as out put and the output file should be like test_2016_10_24. I tried using timestamp function but it is failing to load because of the invalid file format unsupported by hdfs.

    How do I do it from a coordinator ? Which variables should I pass as a parameter in the coordinators ? My current version of Hue is 3.9 .
    It would be great if you can let me know how to set it up so that the daily date is appended to the output filename.

    Thanks,
    Sashi

    • Author
      Hue Team 11 months ago

      It could be an outputParameter with the format ‘test_${YEAR}_${MONTH}_${DAY}’

      http://gethue.com/new-apache-oozie-workflow-coordinator-bundle-editors/

      • sashi 11 months ago

        Thanks for your reply…how ever when I set it it says The frequency of the workflow parameter %s cannot be guessed from the frequency of the coordinator. It so needs to be specified manually.

        My current Hue version is 3.9. Any ideas would be highly appreciated.

        Thanks.

  7. sashi 11 months ago

    Thanks Team…I was stuck on it for 2-3 days and based on your inputs and the above video I could successfully resolve it.

    Really appreciate it.

    Thanks,
    Sashi

Leave a reply

Your email address will not be published. Required fields are marked *

*