Analyzing Data with Hue and Hive

Analyzing Data with Hue and Hive

In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).

The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!

Dataset Challenge with Hue

The demo below demonstrates how the “business” and “review” datasets are cleaned and then converted to a Hive table before being queried with SQL.

Now, let’s step through a tutorial based on this demo. The queries and scripts are available on GitHub.

Getting Started & Normalization

First, get the dataset from the Yelp Challenge webpage. Then, clean the data using this script.

  1. Retrieve the data and extract it.
    tar -xvf yelp_phoenix_academic_dataset.tar
    cd yelp_phoenix_academic_dataset
    yelp_phoenix_academic_dataset$ ls notes.txt READ_FIRST-Phoenix_Academic_Dataset_Agreement-3-11-13.pdf yelp_academic_dataset_business.json yelp_academic_dataset_checkin.json yelp_academic_dataset_review.json yelp_academic_dataset_user.json
  2. Convert it to TSV.
    chmod +x


  3. The following column headers will be printed by the above script.
    ["city", "review_count", "name", "neighborhoods", "type", "business_id", "full_address", "state", "longitude", "stars", "latitude", "open", "categories"]
    ["funny", "useful", "cool", "user_id", "review_id", "text", "business_id", "stars", "date", "type"]

Create the Tables

Next, create the Hive tables with the “Create a new table from a file” screen in the Catalog app or Beeswax “Tables” tab.

Creating a new table

Upload the data files yelp_academic_dataset_business_clean.json and yelp_academic_dataset_review_clean.json. Hue will then guess the tab separator and then lets you name each column of the tables. (Tip: in Hue 2.3, you can paste the column names in directly.)

Naming columns

You can then see the table and browse it.

Browsing the table


Open up Hue’s Hive editor (Beeswax) and run one of these queries:

Top 25: business with most of the reviews

SELECT name, review_count
FROM business
ORDER BY review_count DESC

Top 25: coolest restaurants

SELECT r.review_id, name, SUM(cool) AS coolness
FROM review r JOIN business b
ON (r.review_id =
WHERE categories LIKE '%Restaurants%'
GROUP BY r.review_id, name
ORDER BY coolness DESC

Query editor with SQL syntax highlighting and auto-complete

Watch the query runs

See the results with an infinite scroll

Now let your imagination run wild and execute some of your own queries!

Note: This demo is about doing some quick data analytics and exploration. Running more machine learning oriented jobs like the Yelp Examples would deserve a separate blog post on how to run MrJob. Hue users would need to create an Apache Oozie workflow with a Shell action (see below). Notice that a ‘mapred’ user would need to be created first in the User Admin.

Running MrJob Wordcount example in the Oozie app with a Shell action

What’s Next

As you can see, getting started with data analysis is simple with the interactive Hive query editor and Table browser in Hue.

Moreover, all the SELECT queries can also be performed in Hue’s Cloudera Impala application for a real-time experience. Obviously, you would need more data than the sample for doing a fair comparison but the improved interactivity is noticeable.

In upcoming episodes, you’ll see how to use Apache Pig for doing a similar data analysis, and how Oozie can glue everything together in schedulable workflows.

Thank you for watching and hurry up, only one month before the end of the Yelp contest!