How to access Hive in Pig with HCatalog in Hue

How to access Hive in Pig with HCatalog in Hue

This blog post is about accessing the Hive Metastore from Hue, the open source Hadoop UI and clearing up some confusion about HCatalog usage.

What is HCatalog?

Apache HCatalog is a project enabling non-Hive scripts to access Hive tables. You can then directly load tables with Pig or MapReduce without having to worry about re-defining the input schemas, caring about the data location or duplicating it.

 

Hue comes with an application for accessing the Hive metastore within your browser: Metastore Browser. Databases and tables can be navigated through and created or deleted with some wizards.

 

The wizards were demonstrated in the previous tutorial about how to Analyse Yelp data. Hue uses HiveServer2 for accessing the Hive Metastore instead of HCatalog. This is because HiveServer2 is the new secure and multi concurrent server for Hive and it already includes a fast Hive Metastore API.

 

HCatalog connectors are however useful for accessing Hive data from Pig. Here is a demo about accessing the Hive example tables from the Pig Editor.

 

Here is a video summary of the new features:

 

Tutorial

First you need to install HCatalog from here or Cloudera Manager. If you are using a non-pseudo-distributed cluster (e.g. not on a demo VM) make sure that the Hive Metastore is remote or you will have an error like below. Then, upload the 3 jars from /usr/lib/hcatalog/share/hcatalog/ and all the Hive ones from /usr/lib/hive/lib to the Oozie Pig sharelib in /user/oozie/share/lib/pig. This can be done in a few clicks while being logged as ‘oozie’ or ‘hdfs’ in the File Browser. Beware than all the jars will be included in all the future Pig script, which might be unnecessary.

 

Update!

In Hue 3.6 or CDH5, no need to copy the jars anymore. Just include the hive-site.xml file as File in the Properties of the script, e.g, /user/test/hive-site.xml

 

Then make sure the Beeswax examples are installed (Step #2 in the Quick Start Wizard) and open up the Pig Editor and compute the average salary in the table (equivalent of this Hive query):

 

-- Load table 'sample_07'
sample_07 = LOAD 'sample_07' USING org.apache.hcatalog.pig.HCatLoader();

-- Compute the average salary of the table
salaries = GROUP sample_07 ALL;
out = FOREACH salaries GENERATE AVG(sample_07.salary);
DUMP out;

 

As HCatalog needs to access the metastore, we need to specify the hive-site.xml. Go in ‘Properties’, ‘Resources’ and add a ‘File’ pointing to the hive-site.xml uploaded on HDFS.

 

Then submit the script by pressing CTRL + ENTER! The result (47963.62637362637)

will appear at the end of the log output.

 

Notice that we don’t need to redefine the schema as it is automatically picked-up by the loader. If you use the Oozie App, you can now freely use HCatalog in your Pig actions.

 

Warning!

If you are getting this error, it means that your metastore belong to the Hive user and is not remote.

Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.

2013-07-24 23:20:04,969 [main] INFO  DataNucleus.Persistence  - DataNucleus Persistence Factory initialised for datastore URL="jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true" driver="org.apache.derby.jdbc.EmbeddedDriver" userName="APP"

 

sudo rm /var/lib/hive/metastore/metastore_db/*lck
sudo chmod 777 -R /var/lib/hive/metastore/metastore_db

 

Similarly as HCatLoader, use HCatStorer for updating back the table, e.g.:

STORE alias INTO 'sample_07' USING org.apache.hcatalog.pig.HCatStorer();

 

Summary

We saw that Hue makes Hive Metastore easy to access and supports the HCatalog connectors for Pig. Hue 3.0 with simplify it even more by automatically copying the required jar files and making the table names auto-completable!

As usual, we welcome any feedback on the user group!

13 Comments

  1. mark 3 years ago

    To make this work requires uploading a huge number of jar files into oozie’s hdfs lib directory. The video does this quickly and no mention is made about where to get these files. In the video, these are shown as coming from the /downloads/ directory – but where did they come from?

    I assume that these are somewhere on the VM? Is there a list?

    Why weren’t these in the oozie lib to begin with if Pig needs them?

  2. mark 3 years ago

    The same is true for the hive-site.xml file. Where did this come from? Does it need to be configured?

    These are critical things – yet the tutorial just breezes through them – and assumes that somehow we all have a bunch of jars in our downloads directory and that we’ve already uploaded a critical config file.

    For what it’s worth, it took 1 minute to get a simple Pig script running on Hortonworks.

  3. Hue Team 3 years ago

    Did you read this part?
    “Update!
    In Hue 3.6 or CDH5, no need to copy the jars anymore. Just include the hive-site.xml file as File in the Properties of the script, e.g, /user/test/hive-site.xml”

    Hue is an open source software that is shipped on the Hadoop VM, we don’t control the VM themselves. We can see if Hue could automatically pickup its local hive-site.xml to help and also contact the Cloudera VM guys.

  4. mahesh 3 years ago

    I want to store the Pig output data in Hive table but in Hue it only supports for org.apache.hcatalog.pig.HCatLoader but it not supported for org.apache.hcatalog.pig.HCatStorer . How can i reslove this problem ? please help me asap

    • Hue Team 3 years ago

      One possible solution could be saving your output as a CSV file and creating a Hive table with that from the Metastore app

  5. Ajith 3 years ago

    Can you please specify how to access the parameters from job.properties in hue. In my case i want some parameters to be passed dynamically to the workflow like input path or date parameter instead of giving it manually.

  6. Yelah 2 years ago

    Hi
    I have the following error:
    pig script failed to validate: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.hcatalog.pig.HCatLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

    It seems that it can use the HCatalogLoader, what can I do?

  7. Yelah 2 years ago

    ok so i tried cloudera quickstart 5.4.0 and 5.3.0 and it seems that the error doesn’t occurs in cloudera 5.3.0 unlike the version 5.4.0

    • Hue Team 2 years ago

      Recently org.apache.hcatalog.pig.HCatLoader –> org.apache.hive.hcatalog.pig.HCatLoader

      We updated Hue to detect both and add the required jars: https://issues.cloudera.org/browse/HUE-2716

      A hack would be to put a comment in the file that looks like:
      — org.apache.hcatalog.pig.HCatLoader

      or upgrade to CDH5.4.1 that has the fix!

Leave a reply

Your email address will not be published. Required fields are marked *

*