In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).
The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!
Dataset Challenge with Hue
The demo below demonstrates how the “business” and “review” datasets are cleaned and then converted to a Hive table before being queried with SQL.
Now, let’s step through a tutorial based on this demo. The queries and scripts are available on GitHub.
Getting Started & Normalization
First, get the dataset from the Yelp Challenge webpage. Then, clean the data using this script.
- Retrieve the data and extract it.
tar -xvf yelp_phoenix_academic_dataset.tar
cd yelp_phoenix_academic_dataset wget https://raw.github.com/romainr/yelp-data-analysis/master/convert.py</a>
yelp_phoenix_academic_dataset$ ls convert.py notes.txt READ_FIRST-Phoenix_Academic_Dataset_Agreement-3-11-13.pdf yelp_academic_dataset_business.json yelp_academic_dataset_checkin.json yelp_academic_dataset_review.json yelp_academic_dataset_user.json
-
Convert it to TSV.
chmod +x convert.py ./convert.py
- The following column headers will be printed by the above script.
[“city”, “review_count”, “name”, “neighborhoods”, “type”, “business_id”, “full_address”, “state”, “longitude”, “stars”, “latitude”, “open”, “categories”]
[“funny”, “useful”, “cool”, “user_id”, “review_id”, “text”, “business_id”, “stars”, “date”, “type”]
### Create the Tables
Next, create the Hive tables with the “Create a new table from a file” screen in the Catalog app or Beeswax “Tables” tab.
[<img title="hue1" src="http://www.cloudera.com/wp-content/uploads/2013/04/hue1.png"/>][8]
<p class="center-align">
<strong>Creating a new table</strong>
</p>
Upload the data files yelp_academic_dataset_business_clean.json and yelp_academic_dataset_review_clean.json. Hue will then guess the tab separator and then lets you name each column of the tables. (Tip: in Hue 2.3, you can paste the column names in directly.)
[<img title="hue2" src="http://www.cloudera.com/wp-content/uploads/2013/04/hue2.png"/>][9]
<p class="center-align">
<strong>Naming columns</strong>
</p>
You can then see the table and browse it.
[<img title="hue3" src="http://www.cloudera.com/wp-content/uploads/2013/04/hue3.png"/>][10]
<p class="center-align">
<strong>Browsing the table</strong>
</p>
### Queries
Open up Hue’s Hive editor (Beeswax) and run one of these queries:
**Top 25: business with most of the reviews**
<pre><code class="sql">
SELECT name, review_count
FROM business
ORDER BY review_count DESC
LIMIT 25
{{ < /highlight >}}
**Top 25: coolest restaurants**
<pre><code class="sql">SELECT r.review_id, name, SUM(cool) AS coolness
FROM review r JOIN business b
ON (r.review_id = b.id)
WHERE categories LIKE '%Restaurants%'
GROUP BY r.review_id, name
ORDER BY coolness DESC
LIMIT 25
</code></pre>
[<img title="hue4" src="http://www.cloudera.com/wp-content/uploads/2013/04/hue4.png"/>][11]
<p class="center-align">
<strong>Query editor with SQL syntax highlighting and auto-complete<br /> </strong>
</p>
[<img title="hue5" src="http://www.cloudera.com/wp-content/uploads/2013/04/hue5.png"/>][12]
<p class="center-align">
<strong>Watch the query runs<br /> </strong>
</p>
[<img title="hue6" src="http://www.cloudera.com/wp-content/uploads/2013/04/hue61.png"/>][13]
<p class="center-align">
<strong>See the results with an infinite scroll</strong>
</p>
Now let your imagination run wild and execute some of your own queries!
Note: This demo is about doing some quick data analytics and exploration. Running more machine learning oriented jobs like the [Yelp Examples][14] would deserve a separate blog post on how to run [MrJob][15]. Hue users would need to create an Apache Oozie workflow with a Shell action (see below). Notice that a ‘mapred’ user would need to be created first in the User Admin.
[<img title="hue7" src="http://www.cloudera.com/wp-content/uploads/2013/04/hue71.png"/>][16]
<p class="center-align">
<strong>Running MrJob Wordcount example in the Oozie app with a Shell action</strong>
</p>
### What’s Next
As you can see, getting started with data analysis is simple with the interactive Hive query editor and Table browser in Hue.
Moreover, all the `SELECT` queries can also be performed in Hue’s Cloudera [Impala][17] application for a real-time experience. Obviously, you would need more data than the sample for doing a fair comparison but the improved interactivity is noticeable.
In upcoming episodes, you’ll see how to use Apache Pig for doing a similar data analysis, and how Oozie can glue everything together in schedulable workflows.
Thank you for watching and hurry up, only one month before the end of the [Yelp contest][6]!