Better file formats for Impala and quick SQL in Hadoop

Better file formats for Impala and quick SQL in Hadoop

Impala File Formats

Using the best file format is crucial for getting great performances. This is one reason with JSON is no supported in the Impala application. Indeed, parsing or retrieving all the text record even for one field would damage the performance badly. Impala is recommending a series of alternativeformats.

 

We show here how to create a Hive table in Avro format containing json data and a table in the new parquet format. We are using the same Yelp data from Episode 2 of the Season 2 of the Hadoop Tutorial series.

 

Avro

The first step is to convert our data into JSON with the help of a Pig script. Open up the Pig Editor and run:

 

REGISTER piggybank.jar

data = load '/user/hive/warehouse/review/yelp_academic_dataset_review_clean.json'
    AS (funny:INT, useful:INT, cool:INT, user_id:CHARARRAY, review_id:CHARARRAY, text:CHARARRAY, business_id:CHARARRAY, stars:INT, date:CHARARRAY, type:CHARARRAY);

data_clean = FILTER data BY business_id IS NOT NULL AND text IS NOT NULL;    

STORE data_clean INTO 'impala/reviews_avro'
USING org.apache.pig.piggybank.storage.avro.AvroStorage(
'{
"schema": {
  "name": "review",
  "type": "record",
  "fields": [
     {"name":"funny", "type":"int"},
     {"name":"useful", "type":"int"},
     {"name":"cool", "type":"int"},
     {"name":"user_id", "type":"string"}
     {"name":"review_id", "type":"string"},
     {"name":"text", "type":"string"},
     {"name":"business_id", "type":"string"},
     {"name":"stars", "type":"int"},
     {"name":"date", "type":"string"},      
     {"name":"type", "type":"string"},
  ]}
}');

 

Then, in the Hive Editor create the table with:

 

CREATE TABLE review_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/romain/impala/reviews_avro'
tblproperties ('avro.schema.literal'='{
  "name": "review",
  "type": "record",
  "fields": [
     {"name":"business_id", "type":"string"},
     {"name":"cool", "type":"int"},
     {"name":"date", "type":"string"},
     {"name":"funny", "type":"int"},
     {"name":"review_id", "type":"string"},
     {"name":"stars", "type":"int"},
     {"name":"text", "type":"string"},
     {"name":"type", "type":"string"},
     {"name":"useful", "type":"int"},
     {"name":"user_id", "type":"string"}]}'
);

 

You can now go back to Impala, and use the table after having refreshed the metadata with:

 

REFRESH avro_table

 

Parquet

Parquet is a new column-oriented binary file format, particularly efficient in Impala. Here is how to create a table from the Impala app:

 

CREATE TABLE review_parquet LIKE review STORED AS PARQUETFILE;

 

And then load data:

 

INSERT OVERWRITE review_parquet  SELECT * FROM review;

 

Take the time to read about the goal of each format and how to enable compression. If you want to know more, the Impala tuning guide is a good reference too.

 

As usual feel free to comment on the hue-user list or @gethue!

 

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*