Season II: 6. Use Pig and Hive with HBase

21 October 2013 in Browsing / Querying / Tutorial - 3 minutes read

The HBase app is an elegant way to visualize and search a lot of data. Apache HBase tables can be tricky to update as they require lower level API. Some good alternative for simplifying the data management or access is to use Apache Pig or Hive.

 

In this post we are going to show how to load our yelp data from the Oozie Bundles episode into HBase with Hive. Then we will use the HBase Browser to visualize it and Pig to compute some statistics.

 

Access HBase with Hive

 

First, let’s use Beeswax to create a Hive table that is persisted as a HBase table. The script works as intended when using HiveServer2 as the Hive backend. Some HBase jar need to be registered, as shown in the video.

 

In our use case of Yelp data, map is the correct data type for our HBase that will created as EXTERNAL.

 

Here is the create table statement for creating a table that is going to store the top N coolest restaurants for everyday:

 

set hbase.zookeeper.quorum my-hbase.com

CREATE TABLE top_cool_hbase (key string, value map<string, int>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,review:")
TBLPROPERTIES ("hbase.table.name" = "top_cool");

 

In order to allow Hive to use HBase some jars need to be registered (one by session). Upload them on HDFS and add them as resources in the first create table query:

/usr/lib/hive/lib/zookeeper.jar;
/usr/lib/hive/lib/hbase.jar;
/usr/lib/hive/lib/hive-hbase-handler-0.XX.0-cdhX.X.X.jar
/usr/lib/hive/lib/guava-11.0.2.jar;

 

Then lets add data to our new table. We copy it from our top_cool table of the previous episode.

INSERT OVERWRITE TABLE top_cool_hbase SELECT name, map(`date`, cast(coolness as int)) FROM top_cool

 

If you don’t have the table from the past episode, you can still use the one from episode one as a workaround:

INSERT OVERWRITE TABLE top_cool_hbase SELECT name, map(`date`, cast(r.stars as int)) FROM review r JOIN business b ON r.business_id = b.business_id;

Access HBase with HBase Browser

As seen in the video, the HBase app provides a slick new Web interface to HBase.

 

Access HBase with Pig

Pig comes with some built-in HBaseStorage and HBaseLoader. After registering two jars, you will be able to use them. Here is the script for dumping all the counts of a particular day:

 

REGISTER /usr/lib/zookeeper/zookeeper-3.4.5-cdhX.X.X.jar
REGISTER /usr/lib/hbase/hbase-0.94.6-cdhX.X.X-security.jar

set hbase.zookeeper.quorum 'localhost'

data = LOAD 'hbase://top_cool'
       USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('review:*', '-loadKey true')
       as (name:CHARARRAY, dates:MAP[]);

counts =
    FOREACH data
    GENERATE name, dates#'2012-12-02';

DUMP counts;

 

Sum-up

Hive and Pig are excellent tools for manipulating HBase data. All combinations are possible, the sky is the limit! For example you could load from HBase and save into Hive table with Pig or use Hive SQL to query HBase tables. You can even pull HDFS or Hive data from Pig with Hcatalog, save it into HBase (or vice versa) and browse it with HBase Browser!

Next time, let’s see how to create a search engine from the Yelp data!

As usual, if you have questions or feedback, feel free to contact the Hue community on hue-user or @gethue.com!


comments powered by Disqus

More recent stories

13 November 2019
Visually surfacing SQL information like Primary Keys, Foreign Keys, Views and Complex Types
Read More
31 October 2019
Missing some color? How to improve or add your own SQL syntax Highlighter
Read More
24 October 2019
How to create a HBase table on Kerberized Hadoop clusters
Read More