Season II: 6. Use Pig and Hive with HBase

Season II: 6. Use Pig and Hive with HBase

The HBase app is an elegant way to visualize and search a lot of data. Apache HBase tables can be tricky to update as they require lower level API. Some good alternative for simplifying the data management or access is to use Apache Pig or Hive.

 

In this post we are going to show how to load our yelp data from the Oozie Bundles episode into HBase with Hive. Then we will use the HBase Browser to visualize it and Pig to compute some statistics.

 

Access HBase with Hive

 

First, let’s use Beeswax to create a Hive table that is persisted as a HBase table. The script works as intended when using HiveServer2 as the Hive backend. Some HBase jar need to be registered, as shown in the video.

 

In our use case of Yelp data, map is the correct data type for our HBase that will created as EXTERNAL.

 

Here is the create table statement for creating a table that is going to store the top N coolest restaurants for everyday:

 

set hbase.zookeeper.quorum my-hbase.com

CREATE TABLE top_cool_hbase (key string, value map<string, int>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,review:")
TBLPROPERTIES ("hbase.table.name" = "top_cool");

 

In order to allow Hive to use HBase some jars need to be registered (one by session). Upload them on HDFS and add them as resources in the first create table query:

/usr/lib/hive/lib/zookeeper.jar;
/usr/lib/hive/lib/hbase.jar;
/usr/lib/hive/lib/hive-hbase-handler-0.XX.0-cdhX.X.X.jar
/usr/lib/hive/lib/guava-11.0.2.jar;

 

Then lets add data to our new table. We copy it from our top_cool table of the previous episode.

INSERT OVERWRITE TABLE top_cool_hbase SELECT name, map(`date`, cast(coolness as int)) FROM top_cool

 

If you don’t have the table from the past episode, you can still use the one from episode one as a workaround:

INSERT OVERWRITE TABLE top_cool_hbase SELECT name, map(`date`, cast(r.stars as int)) FROM review r JOIN business b ON r.business_id = b.business_id;

Access HBase with HBase Browser

As seen in the video, the HBase app provides a slick new Web interface to HBase.

 

Access HBase with Pig

Pig comes with some built-in HBaseStorage and HBaseLoader. After registering two jars, you will be able to use them. Here is the script for dumping all the counts of a particular day:

 

REGISTER /usr/lib/zookeeper/zookeeper-3.4.5-cdhX.X.X.jar
REGISTER /usr/lib/hbase/hbase-0.94.6-cdhX.X.X-security.jar

set hbase.zookeeper.quorum 'localhost'

data = LOAD 'hbase://top_cool'
       USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('review:*', '-loadKey true')
       as (name:CHARARRAY, dates:MAP[]);

counts =
    FOREACH data
    GENERATE name, dates#'2012-12-02';

DUMP counts;

 

Sum-up

Hive and Pig are excellent tools for manipulating HBase data. All combinations are possible, the sky is the limit! For example you could load from HBase and save into Hive table with Pig or use Hive SQL to query HBase tables. You can even pull HDFS or Hive data from Pig with Hcatalog, save it into HBase (or vice versa) and browse it with HBase Browser!

Next time, let’s see how to create a search engine from the Yelp data!

As usual, if you have questions or feedback, feel free to contact the Hue community on hue-user or @gethue.com!

5 Comments

  1. Nam 3 years ago

    This tutorial can not be applied with CDH 5.1.2. Because I couldn’t find “hbase-0.98.1-cdh5.1.2-security.jar” in Hbase 0.98.1.
    Was it removed or replaced by other file?
    If the answer is yes, what hbase file do i need to register to allow pig script to use hbase ?

    • Hue Team 3 years ago

      The HBase lib has changed a bit, you can execute this command to get your specific version:

      find /usr/lib/hbase/ -name ‘*.jar’ -exec grep -Hls TableInputFormat {} \;

      it should be something in this format:

      /usr/lib/hbase/hbase-server-[hbase_version]-cdh[cdh_version].jar

  2. sanyog 8 months ago

    Hello thanks for article and support Hue Team
    ** above hbase security jar file is not available in hbase lib folder.**
    even not available in this link:(it has all the available jar files.)
    http://www.java2s.com/Code/Jar/h/Downloadhbaseclient0985hadoop2jar.htm

    I am using hbase version-0.98 and pig version -0.15

    I am trying to communicate hbase data in pig but its show an error :

    pig script failed to validate: java.lang.RuntimeException: could not instantiate ‘org.apache.pig.backend.hadoop.hbase.HBaseStorage’ with arguments ‘[f:cnt, -loadKey true]’

    Please reverts as you see what can i use in alternative of it.
    Thanks in advance.
    Regards
    ST

Leave a reply

Your email address will not be published. Required fields are marked *

*