Welcome to season 2 of the Hue video series. In this new chapter we are going to demonstrate how Hue can simplify Hadoop usage and lets you focus on the business and less about the underlying technology. In a real life scenario, we will use various Hadoop tools within the Hue UI and explore some data and extract some competitive advantage insights from it.
Let’s go surf the Big Data wave, directly from your Browser!
We want to open a new restaurant. In order to optimize our future business we would like to learn more about the existing restaurants, which tastes are trending, what food eaters are looking for or are positive/negative about… In order to answer these questions, we are going to need some data.
Luckily, Yelp is providing some datasets of restaurants and reviews and we download them. What’s next? Let’s move the data into Hadoop and make it queryable!
Convert Json data with Pig
The current format is Json, which is easy to save but difficult to query as it consist in one big record for each row and requires a more sophisticated loader. We are also going to cleanup the data a bit in the process.
In order to do this in a scalable way, we are going to use the query tool Apache Pig and to make it easy, the Pig Editor in Hue. We explain two ways to do it.
All the code is available on the Hadoop Tutorial github.
Method 1: Pig JsonLoader/JsonStorage
Pig natively provides a JsonLoader. We load our data and map it to a schema, then explode the votes into 3 columns. Notice the clean-up of the text of the reviews.
Here is the script:
reviews = LOAD 'yelp_academic_dataset_review.json' USING JsonLoader('votes:map[],user_id:chararray,review_id:chararray,stars:int,date:chararray,text:chararray,type:chararray,business_id:chararray'); tabs = FOREACH reviews GENERATE (INT) votes#'funny', (INT) votes#'useful', (INT) votes#'cool', user_id, review_id, stars, REPLACE(REPLACE(text, 'n', ''), 't', ''), date, type, business_id; STORE tabs INTO 'yelp_academic_dataset_review.tsv';
Note: if the script fails with a ClassNotFound exception, you might need to logging as ‘oozie’ or ‘hdfs’ and upload /usr/lib/pig/lib/json-simple-1.1.jar into /user/oozie/share/lib/pig on HDFS with File Browser.
Method 2: Pig Python UDF
Let’s convert the business data to TSV with a great Pig features: Python UDF. We are going to process each row with with a UDF loading the Json records one by one and printing them with tabs as delimiter.
As Pig is currently using Jython 2.5 for executing Python UDF and there is no builtin json lib, we need to download jyson from http://downloads.xhaus.com/jyson/. Grab the jyson-1.0.2 version, extract it and upload jyson-1.0.2.jar to /user/oozie/share/lib/pig with FileBrowser.
We need to import our Python UDF into Pig. Open up the Pig Editor and upload a file resource named converter.py. You can also create the file directly on HDFS with FileBrowser, then edit it and add this script:
from com.xhaus.jyson import JysonCodec as json @outputSchema("business:chararray") def tsvify(line): business_json = json.loads(line) business = map(unicode, business_json.values()) return 't'.join(business).replace('n', ' ').encode('utf-8')
Go to ‘Properties’, ‘Resource’ and specify the path to converter.py on HDFS.
You are then ready to type the following Pig script:
REGISTER 'converter.py' USING jython AS converter; reviews = LOAD '/user/romain/yelp/yelp_academic_dataset_business.json' AS (line:CHARARRAY); tsv = FOREACH reviews GENERATE converter.tsvify(line); STORE tsv INTO 'yelp_academic_dataset_business.tsv'
What’s next?
Pig is a powerful tool for processing terabytes of data and Hue Pig Editor makes it easier to play around. Python UDF will become part of the editor when HUE-1136 is finished. In episode 3, we will see how to convert to even better formats.
In the next episode, let’s see how to query the data and learn more about the restaurant market!
19 Comments
-
This is awesome. I will try to play around with it.
-
I am using chd5 quick start virtual machine, the path /user/oozie/share/lib/pig is not a vailable. And I face errors when running the udf.
-
You need to install the Oozie sharelib: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_oozie_configure.html
Feel free to ask on http://groups.google.com/a/cloudera.org/group/hue-user for more details if needed.
-
I’m also using cdh5 quickstart and in cloudera manager i can see that oozie is installed. But when I run a scipt with JsonLoader ,it works without “STORE” statement… but if I add it, script will fail… Do I have to change something in the installation?
-
How about following up on the Hue user list for this problem: https://groups.google.com/a/cloudera.org/forum/#!topic/hue-user/Ig9OHH_EgcU
-
-
I tried to upload the json to Cloudera Live demo. It says upload is disable in live demo. How can we POC if it is disabled?
-
The online demo is read only for security version. We have a second version planned that will fix this.
In the meantime we recommend a VM or local install for any serious POC: http://www.cloudera.com/content/support/en/downloads.html
-
-
Thanks! this is a very useful article.
I’m using the new CDH5 distribution – there is an issue with oozie and the sharedliblist (which displays as empty).
Be sure to:
$ export OOZIE_URL=http://localhost:11000/oozie
the move the hdfs share directory for oozie to old-share:
$ hadoop fs -mv /user/oozie/share /user/oozie/old-share
and re-create the share like this:
$ hadoop fs -mkdir /user/oozie/share/lib
$ sudo oozie-setup sharelib create -fs hdfs://localhost:8020 -locallib /usr/lib/oozie/oozie-sharelib-yarn.tar.gz
your share will now be using the new timestamp format e.g. lib_20140918162047
so to put the jyson-1.0.2.jar i did this:
$ hadoop fs -put jyson-1.0.2/lib/jyson-1.0.2.jar /user/oozie/share/lib/lib_20140918162047/pig/
and then restart oozie:
$ sudo service oozie stop
$ sudo service oozie startnow i can see the shareliblist (and run the code in this article):
$ oozie admin -shareliblist pig
[Available ShareLib]
pig
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/ant-1.6.5.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/antlr-2.7.7.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/antlr-runtime-3.4.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/automaton-1.11-8.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/commons-collections-3.2.1.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/commons-el-1.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/commons-httpclient-3.1.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/commons-io-2.1.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/core-3.1.1.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/findbugs-annotations-1.3.9-1.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/guava-11.0.2.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/hbase-client-0.98.1-cdh5.1.0-tests.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/hbase-client-0.98.1-cdh5.1.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/hbase-common-0.98.1-cdh5.1.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/hbase-protocol-0.98.1-cdh5.1.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/hsqldb-1.8.0.10.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/htrace-core-2.04.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jansi-1.9.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jasper-compiler-5.5.23.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jasper-runtime-5.5.23.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jets3t-0.6.1.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jetty-6.1.14.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jetty-util-6.1.26.cloudera.2.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jline-0.9.94.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/joda-time-1.6.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jsch-0.1.42.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jsp-2.1-6.1.14.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jsp-api-2.1-6.1.14.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jsr305-1.3.9.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jyson-1.0.2.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/jython-standalone-2.5.3.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/kfs-0.3.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/netty-3.6.6.Final.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/oozie-sharelib-pig-4.0.0-cdh5.1.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/oro-2.0.8.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/parquet-pig-bundle-1.2.5-cdh5.1.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/pig-0.12.0-cdh5.1.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/protobuf-java-2.5.0.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/servlet-api-2.5-6.1.14.jar
hdfs://quickstart.cloudera:8020/user/oozie/share/lib/lib_20140918162047/pig/stringtemplate-3.2.1.jar-
Indeed, this is correct, in CDH5 the sharelib changed and need to be updated like you did!
-
-
Hi
I am getting the below error when I run the Pig Script ‘converter.py’. Can you please help me how to fix this?Apache Pig version 0.12.0-cdh5.1.0 (rexported)
compiled Jul 12 2014, 08:41:26Run pig script using PigRunner.run() for Pig version 0.8+
2014-10-14 18:39:53,744 [main] INFO org.apache.pig.Main – Apache Pig version 0.12.0-cdh5.1.0 (rexported) compiled Jul 12 2014, 08:41:26
2014-10-14 18:39:53,745 [main] INFO org.apache.pig.Main – Logging error messages to: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1413332198888_0004/container_1413332198888_0004_01_000002/pig-job_1413332198888_0004.log
2014-10-14 18:39:54,067 [main] INFO org.apache.pig.impl.util.Utils – Default bootup file /var/lib/hadoop-yarn/.pigbootup not found
2014-10-14 18:39:54,419 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-10-14 18:39:54,419 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2014-10-14 18:39:54,420 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://quickstart.cloudera:8020
2014-10-14 18:39:54,436 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: localhost:8032
2014-10-14 18:39:54,594 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2014-10-14 18:39:54,594 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-10-14 18:39:54,656 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine – created tmp python.cachedir=/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1413332198888_0004/container_1413332198888_0004_01_000002/tmp/pig_jython_6753505879123046001
2014-10-14 18:40:05,207 [main] WARN org.apache.pig.scripting.jython.JythonScriptEngine – pig.cmd.args.remainders is empty. This is not expected unless on testing.
2014-10-14 18:40:08,185 [main] INFO org.apache.pig.scripting.jython.JythonScriptEngine – Register scripting UDF: converter.tsvify
2014-10-14 18:40:08,996 [main] INFO org.apache.pig.scripting.jython.JythonFunction – Schema ‘business:chararray’ defined for func tsvify
2014-10-14 18:40:09,119 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1000: Error during parsing. Lexical error at line 8, column 0. Encountered: after : “”-
Converter.py is not a pig script but a Python UDF (user defined function) that will be used by the Pig Script itself you can find here: https://github.com/romainr/hadoop-tutorials-examples/blob/master/pig-json-python-udf/clean_json.pig
-
-
Hi
Thanks for your mail. I agree with your above comments. I followed the steps as you mentioned above but still unable to process the data when I run a python UDF that will be used by a Pig script.2014-10-20 10:25:03,155 [main] INFO org.apache.pig.scripting.jython.JythonFunction – Schema ‘business:chararray’ defined for func tsvify
2014-10-20 10:25:03,595 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 1000: Error during parsing. Lexical error at line 11, column 0. Encountered: after : “”Can you please help me to fix the issue
-
Srinivas, the last line, currently:
STORE tsv INTO ‘yelp_academic_dataset_business.tsv’
just needs a semicolon at the end:
STORE tsv INTO ‘yelp_academic_dataset_business.tsv’;
-
-
Hello: I am running into an error while running the converter.py on the yelp.business dataset. The log is below. What should I do to resolve this issue? I have been trying to look up solutions on the internet since yesterday.
Apache Pig version 0.12.0-cdh5.5.0 (rexported)
compiled Nov 09 2015, 12:41:48Run pig script using PigRunner.run() for Pig version 0.8+
2016-01-26 09:25:00,093 [uber-SubtaskRunner] INFO org.apache.pig.Main – Apache Pig version 0.12.0-cdh5.5.0 (rexported) compiled Nov 09 2015, 12:41:48
2016-01-26 09:25:00,107 [uber-SubtaskRunner] INFO org.apache.pig.Main – Logging error messages to: /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1453822911018_0001/container_1453822911018_0001_01_000001/pig-job_1453822911018_0001.log
2016-01-26 09:25:00,470 [uber-SubtaskRunner] INFO org.apache.pig.impl.util.Utils – Default bootup file /var/lib/hadoop-yarn/.pigbootup not found
2016-01-26 09:25:01,021 [uber-SubtaskRunner] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-01-26 09:25:01,021 [uber-SubtaskRunner] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2016-01-26 09:25:01,021 [uber-SubtaskRunner] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: hdfs://quickstart.cloudera:8020
2016-01-26 09:25:01,053 [uber-SubtaskRunner] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to map-reduce job tracker at: localhost:8032
2016-01-26 09:25:01,060 [uber-SubtaskRunner] WARN org.apache.pig.PigServer – Empty string specified for jar path
2016-01-26 09:25:01,558 [uber-SubtaskRunner] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
2016-01-26 09:25:01,575 [uber-SubtaskRunner] INFO org.apache.hadoop.conf.Configuration.deprecation – mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2016-01-26 09:25:01,653 [uber-SubtaskRunner] INFO org.apache.pig.scripting.jython.JythonScriptEngine – created tmp python.cachedir=./tmp/pig_jython_1452569248124020506
2016-01-26 09:25:03,071 [communication thread] INFO org.apache.hadoop.mapred.TaskAttemptListenerImpl – Progress of TaskAttempt attempt_1453822911018_0001_m_000000_0 is : 1.0
2016-01-26 09:25:08,268 [uber-SubtaskRunner] WARN org.apache.pig.scripting.jython.JythonScriptEngine – pig.cmd.args.remainders is empty. This is not expected unless on testing.-
Do you have more info in the MapReduce logs in Job Browser or Oozie logs?
-
-
I ‘m new at this , I wonder if there is a tutorial Step by step up the database bak , I would like to be more Illustrative with the use of HUE since for beginners is very difficult
Greetings from Ecuador -
I’m new to this as well. I wonder what’s the factor that impacting my speed. As the demo showed, it’s very fast when the job is processing the logs whereas in my vm, it takes really long to see them get started. Esp. for the 2nd part of the demo -to process the business table.
The jobs are accepted but never be successful
-
Author
Did you turn off the services not used in Cloudera Manager in the VM? (e.g. HBase, Spark etc) It should free up some resources
-
-
Hi I am seeing error in Pig 0.14.1 with Jython 1.0.2. Stacktrace below:
at com.xhaus.jyson.JysonCodec.loads(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)java.lang.ClassCastException: java.lang.ClassCastException: org.python.core.PyReflectedFunction cannot be cast to org.python.core.PyString
at org.python.core.Py.JavaError(Py.java:495)
at org.python.core.Py.JavaError(Py.java:488)
at org.python.core.PyReflectedFunction.__call__(PyReflectedFunction.java:188)
at org.python.core.PyReflectedFunction.__call__(PyReflectedFunction.java:204)
at org.python.core.PyObject.__call__(PyObject.java:387)
at org.python.core.PyObject.__call__(PyObject.java:391)
at org.python.pycode._pyx7.bag_of_tags$1(/homes/rsegu/Quasar/fileds_to_tags.py:43)
at org.python.pycode._pyx7.call_function(/homes/rsegu/Quasar/fileds_to_tags.py)
at org.python.core.PyTableCode.call(PyTableCode.java:165)
at org.python.core.PyBaseCode.call(PyBaseCode.java:301)
at org.python.core.PyFunction.function___call__(PyFunction.java:376)
at org.python.core.PyFunction.__call__(PyFunction.java:371)
at org.python.core.PyFunction.__call__(PyFunction.java:361)
at org.python.core.PyFunction.__call__(PyFunction.java:356)
at org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:117)
… 22 more
Caused by: java.lang.ClassCastException: org.python.core.PyReflectedFunction cannot be cast to org.python.core.PyString
at com.xhaus.jyson.JysonCodec.loads(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.python.core.PyReflectedFunction.__call__(PyReflectedFunction.java:186)
… 34 more