How-to: Analyze Twitter Data with Hue

How-to: Analyze Twitter Data with Hue

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

Collecting Data

The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.

The second step consists of collecting some tweet data from the live Twitter stream.

Apache Flume is an elegant solution for taking care of this. The configuration of Flume is detailed in the readme and previous blog post. However, if you want to skip this step, some data is available on GitHub. Just upload it as a zip file in the home directory of the flume user and the “tweets” directory will show up after a few seconds.

If you are not taking this shortcut, create the tweets directory in the File Browser with the New Folder action.

Then, when the Flume agent is started, the data will start appearing:

Clicking on a file will display its content in the built-in viewer:

Preparing Hive

It is time to prepare the analysis of the tweet data. We’ll use Apache Hive, which can query the data with SQL-like syntax in a scalable way. The detailed description of the Hive setup is detailed in the readme.

When Hive is ready, the tweet table can be created in the query editor of Beeswax. Notice that the Hive SerDe (to download or compile here) must be included as a jar in the query. You can read more about Hive SerDe in this previous post.

To do this, just click on “Add” > “File Resources”, click on the path chooser button, click on the “Home” button, and upload hive-serdes-1.0-SNAPSHOT.jar.

Then just enter the CREATE TABLE statement and execute it:

CREATE EXTERNAL TABLE tweets (
  id BIGINT,
  created_at STRING,
  source STRING,
  favorited BOOLEAN,
  retweet_count INT,
  retweeted_status STRUCT<
    text:STRING,
    user:STRUCT<screen_name:STRING,name:STRING>>,
  entities STRUCT<
    urls:ARRAY<STRUCT<expanded_url:STRING>>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT<text:STRING>>>,

  text STRING,
  user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
  in_reply_to_screen_name STRING
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets'

Now that the table is created, let’s insert some data in the table. First, select the table in the “Table” tab and click “Import data”. Enter the path “/user/flume/tweets/2013/02/25/17″ and “201302251″ as the key:

Depending on the partition picked, a query similar to this will be generated:

LOAD DATA INPATH '/user/flume/tweets/2013/02/25/16'
INTO TABLE `default.tweets`
PARTITION (datehour='2013022516')

After the query executes, the table ‘tweets’ will be available.

Beeswax can access the Hive metastore and its list of tables. A description of their schema and partitions with some example of data contained in each table are helpful while designing your queries. Moreover, a wizard can guide you step-by-step to create new tables.

Analysis with Beeswax

It becomes now possible to perform some SELECT queries on the data. Here is an example below but most of interesting ones are described in Parts 1 and 3 of the “Analyzing Twitter with Hadoop” series.

SELECT
    t.retweeted_screen_name,
    sum(retweets) AS total_retweets,
    count(*) AS tweet_count
  FROM (SELECT
          retweeted_status.user.screen_name as retweeted_screen_name,
              retweeted_status.text,
              max(retweet_count) as retweets
        FROM tweets
        GROUP BY retweeted_status.user.screen_name,
                 retweeted_status.text) t
  GROUP BY t.retweeted_screen_name
  ORDER BY total_retweets DESC
  LIMIT 10;

Beeswax possesses multiple features for providing a better user experience than the command line shell. For example you can save queries and share them with other users. The result of a query can be exported into a new table or an HDFS file or downloaded to your desktop. Some other good examples are:

  • Ajax refresh of the logs
  • Quick column navigation on the result page
  • MapReduce jobs listing with a direct access to their logs
  • ‘Email me on completion’ setting
  • Multi-database support

Example of the screen while running query:

Seeing the result of the query:

Note: if your queries are failing and you are seeing an error like below, it means that you forgot to add the ‘/user/flume/hive-serdes-1.0-SNAPSHOT.jar’ to the query:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

Conclusion

In this post we focused on how the Beeswax application can make it easy to execute Hive queries. New features such as multi-query (HUE-159), autocomplete, and syntax highlighting (HUE-1063) are going to improve the usability even more.

The next article in this series will elaborate on this topic and describe how Hue’s Apache Oozie application can be used for scheduling Hive queries in a few clicks.

Thank you for reading and feel free to post comments here or on the hue-user list. We also hope to see you at the firstHue meetup (this Wednesday, March 27)!

35 Comments

  1. vivek 3 years ago

    Hope if you can outline the steps needed to recreate oozie workflows for cloudera tutorial from hue UI rather than from xml files.

  2. Ben 3 years ago

    Hey,

    when I’m executing the CREATE TABLE in Hive, i always get a time out. I did everything as it’s sayed there.
    How can I fix that? Thanks!!

  3. Hue Team 3 years ago

    Which Hadoop distribution are you using? How did you install Hive? Could you followup on http://groups.google.com/a/cloudera.org/group/hue-user with more details?

  4. Ben 3 years ago

    I’m using CDH 5 on the Cloudera Quick Start VM. So everything was preinstalled.
    I was able to create the table now, but when i try to open it, it says:

    Hive-Fehler

    Bad status for request TExecuteStatementReq(confOverlay={}, sessionHandle=TSessionHandle(sessionId=THandleIdentifier(secret=’+”\xc0\nr\x0bD\xfe\x9d\x89pC\x1aI#\t’, guid=’\x1ep\x10\x03\xbe\xf2B\xca\xaarA\x90\xe2{\xdd\xb8′)), runAsync=False, statement=’DESCRIBE EXTENDED tweets’): TExecuteStatementResp(status=TStatus(errorCode=40000, errorMessage=’Error while compiling statement: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)’, sqlState=’42000′, infoMessages=None, statusCode=3), operationHandle=None)

  5. Hue Team 3 years ago

    ‘ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found’ means the Serde was not added as a jar in the ‘Settings’ tab.

    • Ben 3 years ago

      thanks for the reply.
      it was both added when i created the table, as well as when i tried to open it..

      • Stefano 2 years ago

        I got the exact same error.
        I do have the hive-serdes-1.0-SNAPSHOT.jar as jar in the Settings tab, both when creating the table and when querying.
        The create statement gives no error, but the query “select * from twitter_test;” returns:

        Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDejava.lang.RuntimeException: java.lang.ClassNotFoundException: com.cloudera.hive.serde.JSONSerDe at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:68)

        • Stefano 2 years ago

          Apparently the JSON SerDe doesn’t allow for “SELECT *” queries.
          It works when I query specific fields.

  6. kkraman 2 years ago

    Nice article. I hope you have published the next article on this series about using Hue’s Apache Oozie application for scheduling Hive queries in a few clicks. Can you share the link ? I could not able to locate it..

    Thanks and keep up the great work.

  7. Shekhar 2 years ago

    Hi,
    I have followed the above mentioned steps and still getting an error when I run the following query:
    Select station from metar_data;

    I get the following error:
    FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)

    I am doing the following:
    1. Using Oracle Big Data Lite VM
    2. NOT USING the flume component to get the twitter data. But, since I already have data in the JSON format, so trying to query the json data thru HIVE

    2.1 Copied the file hive-serdes-1.0-SNAPSHOT.jar to /usr/lib/hive/lib directory
    3. Set the following env variables:
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*
    export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/hive-serdes-1.0-SNAPSHOT.jar:.
    4. copied the json file to /user/oracle/moviework/applog_avro directory
    5. used the following HQL to create the table:
    create external table metar_data (
    station string,
    type string)
    row format SERDE ‘com.cloudera.hive.serde.JSONSerDe’
    location ‘/user/oracle/moviework/applog_avro/’;

    6. Have successfully created the table by using the HUE interface

    I request you to let me know, what other parameters do I need to set, where exactly the JAR file need to be copied etc…

  8. Shekhar 2 years ago

    Hi, I have copied the log for your reference. The log was extracted when the query was executed in HUE:

    Query: select station from metar_data;
    ================ Log starts here
    15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
    15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Completed phase 1 of Semantic Analysis
    15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Get metadata for source tables
    15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Get metadata for subqueries
    15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Get metadata for destination tables
    15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Completed getting MetaData in Semantic Analysis
    15/02/03 00:32:17 INFO parse.SemanticAnalyzer: Completed plan generation
    15/02/03 00:32:17 INFO ql.Driver: Semantic Analysis Completed
    15/02/03 00:32:17 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:station, type:string, comment:null)], properties:null)
    15/02/03 00:32:17 INFO ql.Driver: Creating lock manager of type org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager
    15/02/03 00:32:17 INFO ql.Driver: Starting command: select station from metar_data
    15/02/03 00:32:17 INFO ql.Driver: Total MapReduce jobs = 1
    15/02/03 00:32:17 INFO ql.Driver: Launching Job 1 out of 1
    15/02/03 00:32:17 INFO exec.Task: Number of reduce tasks is set to 0 since there’s no reduce operator
    15/02/03 00:32:17 INFO mr.ExecDriver: Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
    15/02/03 00:32:17 INFO mr.ExecDriver: adding libjars: file:///tmp/f3c130e6-ae2d-4ad1-89ee-844c4f77a6fa_resources/hive-serdes-1.0-SNAPSHOT.jar
    15/02/03 00:32:17 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    15/02/03 00:32:18 INFO exec.Task: Starting Job = job_1422925424797_0003, Tracking URL = http://bigdatalite.localdomain:8088/proxy/application_1422925424797_0003/
    15/02/03 00:32:18 INFO exec.Task: Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1422925424797_0003
    15/02/03 00:32:30 INFO exec.Task: Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    15/02/03 00:32:30 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
    15/02/03 00:32:30 INFO exec.Task: 2015-02-03 00:32:30,759 Stage-1 map = 0%, reduce = 0%
    15/02/03 00:33:07 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
    15/02/03 00:33:07 INFO exec.Task: 2015-02-03 00:33:07,309 Stage-1 map = 100%, reduce = 0%
    15/02/03 00:33:08 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
    15/02/03 00:33:08 ERROR exec.Task: Ended Job = job_1422925424797_0003 with errors
    15/02/03 00:33:08 ERROR ql.Driver: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
    15/02/03 00:33:08 INFO ql.Driver: MapReduce Jobs Launched:
    15/02/03 00:33:08 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
    15/02/03 00:33:08 INFO ql.Driver: Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
    15/02/03 00:33:08 INFO ql.Driver: Total MapReduce CPU Time Spent: 0 msec
    15/02/03 00:33:08 ERROR operation.Operation: Error:
    org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
    at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:146)
    at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:64)
    at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:177)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

    ================ Log Ends here

    Thanks
    Shekhar

  9. Shekhar 2 years ago

    Hi just to update you, I am able to query the table successfully from HUE now and problem was with the column name mentioned and there was a mismatch.

    But now the problem I am facing is, I am unable to query the same table metar_data from hive command prompt. I get the following error:

    FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)

    So,
    1. how / where to mention the location of the this hive-serdes-1.0-SNAPSHOT.jar file?
    2. Which environment variable need to be changed?

    I really appreciate your help.

    Thank you,

    Shekhar

    • Hue Team 2 years ago

      In the ‘Setting’ tab on the top left, you just need to add it as a ‘jar’ ‘File Resource’ one time

  10. Shekhar 2 years ago

    I have already done that. That’s how I was able to create the External Table. Once External table is created thru HUE, if I query the same table in hive> (hive cli), I am getting the error mentioned in my previous post.

    FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)

    That’s the reason, I requested you to please let me know is there anything that needs to be set? Any env variable, path etc….?

    Shekhar

    • Hue Team 2 years ago

      If you use the CLI shell, you will need to register the jar there too, e.g. ADD JAR /path-to/hive-json-serde.jar;

  11. IMANE 2 years ago

    hi ! I’m using hive in a java project
    “private static String driverName = “org.apache.hadoop.hive.jdbc.HiveDriver”;
    private static final String SQL_STATEMENT = “select text from tweets1”;
    public static void main(String[] args) throws SQLException, ClassNotFoundException {
    Class.forName(driverName);
    System.out.println(“driverName1”);
    Connection con = DriverManager.getConnection(“jdbc:hive://localhost:10001/default”, “”, “”); ” I added hive-json-serde.jar to my referenced libraries in eclipse but I have this error

    “Exception in thread “main” java.sql.SQLException: Query returned non-zero code: 40000, cause: FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class com.cloudera.hive.serde.JSONSerDe not found)
    at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
    at hive.HiveQLWhere.main(HiveQLWhere.java:19)”
    I really need your help

    • Hue Team 2 years ago

      Did you register the serder like explained in the blog post? (add the jar as a resource in the setting tab)

  12. Imane 2 years ago

    thanks for the replay ,I solved the problem but it takes a lot of time to get data I tested the program for 1500 tweets , it takes 20 minutes or plus..

    • Hue Team 2 years ago

      We have an upcoming post about collecting tweets in one click at 100 tweets/s 😉

  13. IMANE 2 years ago

    I’m not talking about collecting tweets ,I have them already and I wanted to select them via a query “select * from tweets”

    • Hue Team 2 years ago

      so it takes 20 minutes to get the results back from Hive? just from Hue or from the command line too?

  14. IMANE 2 years ago

    Yep exactly ! from both line command and Hue ,the system takes a lot of time to response ..

  15. IMANE 2 years ago

    hello
    I have a new problem; when I want to select data from table with Hue the system shows me an red error
    ” Your query has the following error(s):
    Could not connect to quickstart.cloudera:10000 ”
    and it worked before today
    plz I need your help

    • Hue Team 2 years ago

      Please restart the Hive Service in Cloudera manager (quickstart.cloudera:7180)

  16. Imane 2 years ago

    it works , thank you Hue team !!

  17. Latha 2 years ago

    Hi,

    I am trying on CDH5 hive external table and only able to retrieve null values using com.cloudera.hive.serde.JSONSerDe

    Data sample:
    {“Foo”:”ABC”,”Bar”:”20090101100000″,”Quux”:{“QuuxId”:1234,”QuuxName”:”Sam”}}
    Hive external table :

    create external table jexample(
    Foo string,
    Bar string,
    Quux struct
    )
    Row format serde ‘com.cloudera.hive.serde.JSONSerDe’
    LOCATION ‘/user/root/data’;

    Can anyone help me to reolve this issue

  18. Hari Prasad 2 years ago

    Team,

    Curious to know if anyone succeeded in executing query?

    I added mentioned Jar in file resources as Jar, still I am getting the error for any query.

    Query:

    SELECT
    t.retweeted_screen_name,
    sum(retweets) AS total_retweets,
    count(*) AS tweet_count
    FROM (SELECT
    retweeted_status.user.screen_name as retweeted_screen_name,
    retweeted_status.text,
    max(retweet_count) as retweets
    FROM tweets
    GROUP BY retweeted_status.user.screen_name,
    retweeted_status.text) t
    GROUP BY t.retweeted_screen_name
    ORDER BY total_retweets DESC
    LIMIT 10;

    Your query has the following error(s):

    Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

    INFO : Number of reduce tasks not specified. Estimated from input data size: 1
    INFO : In order to change the average load for a reducer (in bytes):
    INFO : set hive.exec.reducers.bytes.per.reducer=
    INFO : In order to limit the maximum number of reducers:
    INFO : set hive.exec.reducers.max=
    INFO : In order to set a constant number of reducers:
    INFO : set mapreduce.job.reduces=
    WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    INFO : number of splits:1
    INFO : Submitting tokens for job: job_1432914212475_0002
    INFO : The url to track the job: http://quickstart.cloudera:8088/proxy/application_1432914212475_0002/
    INFO : Starting Job = job_1432914212475_0002, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1432914212475_0002/
    INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1432914212475_0002
    INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    INFO : 2015-05-29 10:20:59,400 Stage-1 map = 0%, reduce = 0%
    INFO : 2015-05-29 10:21:35,687 Stage-1 map = 100%, reduce = 100%
    ERROR : Ended Job = job_1432914212475_0002 with errors

    • Hue Team 2 years ago

      Did you check the MapReduce logs of the failed taks to get the real cause of the problem?

  19. Sunil 2 years ago

    i added the hive-serdes-1.0-SNAPSHOT.jar’ to the hive shell , still i am getting FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask. Please let me know how to fix the error.
    Thanks

    • Hue Team 2 years ago

      Could you look at the errors in the MapReduce logs job?
      When you added the serde, did you just do a ‘show tables’ the first time?

  20. Ayushi 6 months ago

    Hello,
    I added the hive-serdes-1.0-SNAPSHOT.jar’ to the hive lib folder still i am getting,
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe
    How to fix this ?

    • Author
      Hue Team 6 months ago

      Did you do the ‘ADD JAR …jar’?

  21. mike 5 months ago

    Hey I am trying this but im getting this error

    Error while processing statement: FAILED: Execution Error, return code 1
    from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde:
    com.cloudera.hive.SERDE.JSONSerDe

    I have sucessfully added both the hive-serdes-1.0-SNAPSHOT and also the json-serde-1.3.7 jar file

    hive> ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
    Added [/usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar] to class path
    Added resources: [/usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar]

    hive> ADD JAR /usr/lib/hive/lib/json-serde-1.3.7-jar-with- dependencies.jar;
    Added [/usr/lib/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar] to class path
    Added resources: [/usr/lib/hive/lib/json-serde-1.3.7-jar-with-dependencies.jar]

    Currently i have the json-serde-1.3.7 jar in /lib/hive/ should it go somewhere else?

Leave a reply

Your email address will not be published. Required fields are marked *

*