Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.
This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.
Collecting Data
The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.
The second step consists of collecting some tweet data from the live Twitter stream.
Apache Flume is an elegant solution for taking care of this. The configuration of Flume is detailed in the readme and previous blog post. However, if you want to skip this step, some data is available on GitHub. Just upload it as a zip file in the home directory of the flume user and the “tweets” directory will show up after a few seconds.
If you are not taking this shortcut, create the tweets directory in the File Browser with the New Folder action.
Then, when the Flume agent is started, the data will start appearing:
Clicking on a file will display its content in the built-in viewer:
Preparing Hive
It is time to prepare the analysis of the tweet data. We’ll use Apache Hive, which can query the data with SQL-like syntax in a scalable way. The detailed description of the Hive setup is detailed in the readme.
When Hive is ready, the tweet table can be created in the query editor of Beeswax. Notice that the Hive SerDe (to download or compile here) must be included as a jar in the query. You can read more about Hive SerDe in this previous post.
To do this, just click on “Add” > “File Resources”, click on the path chooser button, click on the “Home” button, and upload hive-serdes-1.0-SNAPSHOT.jar.
Then just enter the CREATE TABLE
statement and execute it:
CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, user:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) PARTITIONED BY (datehour INT) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
Now that the table is created, let’s insert some data in the table. First, select the table in the “Table” tab and click “Import data”. Enter the path “/user/flume/tweets/2013/02/25/17″ and “201302251″ as the key:
Depending on the partition picked, a query similar to this will be generated:
LOAD DATA INPATH '/user/flume/tweets/2013/02/25/16' INTO TABLE `default.tweets` PARTITION (datehour='2013022516')
After the query executes, the table ‘tweets’ will be available.
Beeswax can access the Hive metastore and its list of tables. A description of their schema and partitions with some example of data contained in each table are helpful while designing your queries. Moreover, a wizard can guide you step-by-step to create new tables.
Analysis with Beeswax
It becomes now possible to perform some SELECT
queries on the data. Here is an example below but most of interesting ones are described in Parts 1 and 3 of the “Analyzing Twitter with Hadoop” series.
SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;
Beeswax possesses multiple features for providing a better user experience than the command line shell. For example you can save queries and share them with other users. The result of a query can be exported into a new table or an HDFS file or downloaded to your desktop. Some other good examples are:
- Ajax refresh of the logs
- Quick column navigation on the result page
- MapReduce jobs listing with a direct access to their logs
- ‘Email me on completion’ setting
- Multi-database support
Example of the screen while running query:
Seeing the result of the query:
Note: if your queries are failing and you are seeing an error like below, it means that you forgot to add the ‘/user/flume/hive-serdes-1.0-SNAPSHOT.jar’ to the query:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
Conclusion
In this post we focused on how the Beeswax application can make it easy to execute Hive queries. New features such as multi-query (HUE-159), autocomplete, and syntax highlighting (HUE-1063) are going to improve the usability even more.
The next article in this series will elaborate on this topic and describe how Hue’s Apache Oozie application can be used for scheduling Hive queries in a few clicks.
Thank you for reading and feel free to post comments here or on the hue-user list. We also hope to see you at the firstHue meetup (this Wednesday, March 27)!
35 Comments
-
Hope if you can outline the steps needed to recreate oozie workflows for cloudera tutorial from hue UI rather than from xml files.
-
Hey,
when I’m executing the CREATE TABLE in Hive, i always get a time out. I did everything as it’s sayed there.
How can I fix that? Thanks!! -
Which Hadoop distribution are you using? How did you install Hive? Could you followup on http://groups.google.com/a/cloudera.org/group/hue-user with more details?
-
I’m using CDH 5 on the Cloudera Quick Start VM. So everything was preinstalled.
I was able to create the table now, but when i try to open it, it says:Hive-Fehler
Bad status for request TExecuteStatementReq(confOverlay={}, sessionHandle=TSessionHandle(sessionId=THandleIdentifier(secret=’+”\xc0\nr\x0bD\xfe\x9d\x89pC\x1aI#\t’, guid=’\x1ep\x10\x03\xbe\xf2B\xca\xaarA\x90\xe2{\xdd\xb8′)), runAsync=False, statement=’DESCRIBE EXTENDED tweets’): TExecuteStatementResp(status=TStatus(errorCode=40000, errorMessage=’Error while compiling statement: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)’, sqlState=’42000′, infoMessages=None, statusCode=3), operationHandle=None)
-
‘ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found’ means the Serde was not added as a jar in the ‘Settings’ tab.
-
thanks for the reply.
it was both added when i created the table, as well as when i tried to open it..-
I got the exact same error.
I do have the hive-serdes-1.0-SNAPSHOT.jar as jar in the Settings tab, both when creating the table and when querying.
The create statement gives no error, but the query “select * from twitter_test;” returns:Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDejava.lang.RuntimeException: java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDe at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:68)
-
Apparently the JSON SerDe doesn’t allow for “SELECT *” queries.
It works when I query specific fields.
-
-
-
-
Nice article. I hope you have published the next article on this series about using Hue’s Apache Oozie application for scheduling Hive queries in a few clicks. Can you share the link ? I could not able to locate it..
Thanks and keep up the great work.
-
Yes, different content but next step is http://gethue.com/hadoop-tutorials-ii-2-execute-hive-queries-and/ and the full season is http://gethue.com/hadoop-tutorials-ii-1-prepare-the-data-for-analysis/
-
-
Hi,
I have followed the above mentioned steps and still getting an error when I run the following query:
Select station from metar_data;I get the following error:
FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)I am doing the following:
1. Using Oracle Big Data Lite VM
2. NOT USING the flume component to get the twitter data. But, since I already have data in the JSON format, so trying to query the json data thru HIVE2.1 Copied the file hive-serdes-1.0-SNAPSHOT.jar to /usr/lib/hive/lib directory
3. Set the following env variables:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/hive-serdes-1.0-SNAPSHOT.jar:.
4. copied the json file to /user/oracle/moviework/applog_avro directory
5. used the following HQL to create the table:
create external table metar_data (
station string,
type string)
row format SERDE ‘com.cloudera.hive.serde.JSONSerDe’
location ‘/user/oracle/moviework/applog_avro/’;6. Have successfully created the table by using the HUE interface
I request you to let me know, what other parameters do I need to set, where exactly the JAR file need to be copied etc…
-
Hi, I have copied the log for your reference. The log was extracted when the query was executed in HUE:
Query: select station from metar_data;
================ Log starts here
15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic Analysis
15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Get metadata for source tables
15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Get metadata for subqueries
15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Get metadata for destination tables
15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Completed getting MetaData in Semantic Analysis
15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Completed plan generation
15/02/03 00:32:17 INFO ql.Driver: Semantic Analysis Completed
15/02/03 00:32:17 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:station, type:string, comment:null)], properties:null)
15/02/03 00:32:17 INFO ql.Driver: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
15/02/03 00:32:17 INFO ql.Driver: Starting command: select station from metar_data
15/02/03 00:32:17 INFO ql.Driver: Total MapReduce jobs = 1
15/02/03 00:32:17 INFO ql.Driver: Launching Job 1 out of 1
15/02/03 00:32:17 INFO exec.Task: Number of reduce tasks is set to 0 since there’s no reduce operator
15/02/03 00:32:17 INFO mr.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
15/02/03 00:32:17 INFO mr.ExecDriver: adding libjars: file:///tmp/f3c130e6-ae2d-4ad1-89ee-844c4f77a6fa_resources/hive-serdes-1.0-SNAPSHOT.jar
15/02/03 00:32:17 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/02/03 00:32:18 INFO exec.Task: Starting Job = job_1422925424797_0003, Tracking URL = http://bigdatalite.localdomain:8088/proxy/application_1422925424797_0003/
15/02/03 00:32:18 INFO exec.Task: Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1422925424797_0003
15/02/03 00:32:30 INFO exec.Task: Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
15/02/03 00:32:30 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
15/02/03 00:32:30 INFO exec.Task: 2015-02-03 00:32:30,759 Stage-1 map = 0%, reduce = 0%
15/02/03 00:33:07 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
15/02/03 00:33:07 INFO exec.Task: 2015-02-03 00:33:07,309 Stage-1 map = 100%, reduce = 0%
15/02/03 00:33:08 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
15/02/03 00:33:08 ERROR exec.Task: Ended Job = job_1422925424797_0003 with errors
15/02/03 00:33:08 ERROR ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
15/02/03 00:33:08 INFO ql.Driver: MapReduce Jobs Launched:
15/02/03 00:33:08 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
15/02/03 00:33:08 INFO ql.Driver: Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
15/02/03 00:33:08 INFO ql.Driver: Total MapReduce CPU Time Spent: 0 msec
15/02/03 00:33:08 ERROR operation.Operation: Error:
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:146)
at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:64)
at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:177)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)================ Log Ends here
Thanks
Shekhar -
Hi just to update you, I am able to query the table successfully from HUE now and problem was with the column name mentioned and there was a mismatch.
But now the problem I am facing is, I am unable to query the same table metar_data from hive command prompt. I get the following error:
FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)
So,
1. how / where to mention the location of the this hive-serdes-1.0-SNAPSHOT.jar file?
2. Which environment variable need to be changed?I really appreciate your help.
Thank you,
Shekhar
-
In the ‘Setting’ tab on the top left, you just need to add it as a ‘jar’ ‘File Resource’ one time
-
-
I have already done that. That’s how I was able to create the External Table. Once External table is created thru HUE, if I query the same table in hive> (hive cli), I am getting the error mentioned in my previous post.
FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)
That’s the reason, I requested you to please let me know is there anything that needs to be set? Any env variable, path etc….?
Shekhar
-
If you use the CLI shell, you will need to register the jar there too, e.g. ADD JAR /path-to/hive-json-serde.jar;
-
-
hi ! I’m using hive in a java project
“private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”;
private static final String SQL_STATEMENT = “select text from tweets1”;
public static void main(String[] args) throws SQLException, ClassNotFoundException {
Class.forName(driverName);
System.out.println(“driverName1”);
Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10001/default”, “”, “”); ” I added hive-json-serde.jar to my referenced libraries in eclipse but I have this error“Exception in thread “main” java.sql.SQLException: Query returned non-zero code: 40000, cause: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)
at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
at hive.HiveQLWhere.main(HiveQLWhere.java:19)”
I really need your help-
Did you register the serder like explained in the blog post? (add the jar as a resource in the setting tab)
-
-
thanks for the replay ,I solved the problem but it takes a lot of time to get data I tested the program for 1500 tweets , it takes 20 minutes or plus..
-
We have an upcoming post about collecting tweets in one click at 100 tweets/s 😉
-
-
I’m not talking about collecting tweets ,I have them already and I wanted to select them via a query “select * from tweets”
-
so it takes 20 minutes to get the results back from Hive? just from Hue or from the command line too?
-
-
Yep exactly ! from both line command and Hue ,the system takes a lot of time to response ..
-
hello
I have a new problem; when I want to select data from table with Hue the system shows me an red error
” Your query has the following error(s):
Could not connect to quickstart.cloudera:10000 ”
and it worked before today
plz I need your help-
Please restart the Hive Service in Cloudera manager (quickstart.cloudera:7180)
-
-
it works , thank you Hue team !!
-
Hi,
I am trying on CDH5 hive external table and only able to retrieve null values using com.cloudera.hive.serde.JSONSerDe
Data sample:
{“Foo”:”ABC”,”Bar”:”20090101100000″,”Quux”:{“QuuxId”:1234,”QuuxName”:”Sam”}}
Hive external table :create external table jexample(
Foo string,
Bar string,
Quux struct
)
Row format serde ‘com.cloudera.hive.serde.JSONSerDe’
LOCATION ‘/user/root/data’;Can anyone help me to reolve this issue
-
Team,
Curious to know if anyone succeeded in executing query?
I added mentioned Jar in file resources as Jar, still I am getting the error for any query.
Query:
SELECT
t.retweeted_screen_name,
sum(retweets) AS total_retweets,
count(*) AS tweet_count
FROM (SELECT
retweeted_status.user.screen_name as retweeted_screen_name,
retweeted_status.text,
max(retweet_count) as retweets
FROM tweets
GROUP BY retweeted_status.user.screen_name,
retweeted_status.text) t
GROUP BY t.retweeted_screen_name
ORDER BY total_retweets DESC
LIMIT 10;Your query has the following error(s):
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
INFO : Number of reduce tasks not specified. Estimated from input data size: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO : number of splits:1
INFO : Submitting tokens for job: job_1432914212475_0002
INFO : The url to track the job: http://quickstart.cloudera:8088/proxy/application_1432914212475_0002/
INFO : Starting Job = job_1432914212475_0002, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1432914212475_0002/
INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1432914212475_0002
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO : 2015-05-29 10:20:59,400 Stage-1 map = 0%, reduce = 0%
INFO : 2015-05-29 10:21:35,687 Stage-1 map = 100%, reduce = 100%
ERROR : Ended Job = job_1432914212475_0002 with errors-
Did you check the MapReduce logs of the failed taks to get the real cause of the problem?
-
-
i added the hive-serdes-1.0-SNAPSHOT.jar’ to the hive shell , still i am getting FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask. Please let me know how to fix the error.
Thanks-
Could you look at the errors in the MapReduce logs job?
When you added the serde, did you just do a ‘show tables’ the first time?
-
-
Hello,
I added the hive-serdes-1.0-SNAPSHOT.jar’ to the hive lib folder still i am getting,
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe
How to fix this ?-
Author
Did you do the ‘ADD JAR …jar’?
-
-
Hey I am trying this but im getting this error
Error while processing statement: FAILED: Execution Error, return code 1
from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde:
com.cloudera.hive.SERDE.JSONSerDeI have sucessfully added both the hive-serdes-1.0-SNAPSHOT and also the json-serde-1.3.7 jar file
hive> ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
Added [/usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path
Added resources: [/usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar]hive> ADD JAR /usr/lib/hive/lib/json-serde-1.3.7-jar-with- dependencies.jar;
Added [/usr/lib/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar] to class path
Added resources: [/usr/lib/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar]Currently i have the json-serde-1.3.7 jar in /lib/hive/ should it go somewhere else?
-
Author
Could it be that you need more dependencies?
e.g. https://www.google.com/search?q=Cannot+validate+serde%3A+com.cloudera.hive.SERDE.JSONSerDe&oq=Cannot+validate+serde%3A+com.cloudera.hive.SERDE.JSONSerDe&aqs=chrome..69i57.147j0j7&sourceid=chrome&ie=UTF-8
-