Better file formats for Impala and quick SQL in Hadoop

23 October 2013 in Querying - 2 minutes read

Impala File Formats

Using the best file format is crucial for getting great performances. This is one reason with JSON is no supported in the Impala application. Indeed, parsing or retrieving all the text record even for one field would damage the performance badly. Impala is recommending a series of alternativeformats.

 

We show here how to create a Hive table in Avro format containing json data and a table in the new parquet format. We are using the same Yelp data from Episode 2 of the Season 2 of the Hadoop Tutorial series.

 

Avro

The first step is to convert our data into JSON with the help of a Pig script. Open up the Pig Editor and run:

 

REGISTER piggybank.jar

data = load '/user/hive/warehouse/review/yelp_academic_dataset_review_clean.json'

AS (funny:INT, useful:INT, cool:INT, user_id:CHARARRAY, review_id:CHARARRAY, text:CHARARRAY, business_id:CHARARRAY, stars:INT, date:CHARARRAY, type:CHARARRAY);

data_clean = FILTER data BY business_id IS NOT NULL AND text IS NOT NULL;

STORE data_clean INTO 'impala/reviews_avro'

USING org.apache.pig.piggybank.storage.avro.AvroStorage(

'{

"schema": {

"name": "review",

"type": "record",

"fields": [

{"name":"funny", "type":"int"},

{"name":"useful", "type":"int"},

{"name":"cool", "type":"int"},

{"name":"user_id", "type":"string"}

{"name":"review_id", "type":"string"},

{"name":"text", "type":"string"},

{"name":"business_id", "type":"string"},

{"name":"stars", "type":"int"},

{"name":"date", "type":"string"},

{"name":"type", "type":"string"},

]}

}');

 

Then, in the Hive Editor create the table with:

 

CREATE TABLE review_avro

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS

inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION '/user/romain/impala/reviews_avro'

tblproperties ('avro.schema.literal'='{

"name": "review",

"type": "record",

"fields": [

{"name":"business_id", "type":"string"},

{"name":"cool", "type":"int"},

{"name":"date", "type":"string"},

{"name":"funny", "type":"int"},

{"name":"review_id", "type":"string"},

{"name":"stars", "type":"int"},

{"name":"text", "type":"string"},

{"name":"type", "type":"string"},

{"name":"useful", "type":"int"},

{"name":"user_id", "type":"string"}]}'

);

 

You can now go back to Impala, and use the table after having refreshed the metadata with:

 

REFRESH avro_table

 

Parquet

Parquet is a new column-oriented binary file format, particularly efficient in Impala. Here is how to create a table from the Impala app:

 

CREATE TABLE review_parquet LIKE review STORED AS PARQUETFILE;

 

And then load data:

 

INSERT OVERWRITE review_parquet SELECT * FROM review;

 

Take the time to read about the goal of each format and how to enable compression. If you want to know more, the Impala tuning guide is a good reference too.

 

As usual feel free to comment on the hue-user list or @gethue!

 


comments powered by Disqus

More recent stories

13 November 2019
Visually surfacing SQL information like Primary Keys, Foreign Keys, Views and Complex Types
Read More
31 October 2019
Missing some color? How to improve or add your own SQL syntax Highlighter
Read More
24 October 2019
How to create a HBase table on Kerberized Hadoop clusters
Read More