Better file formats for Impala and quick SQL in Hadoop

Published on 23 October 2013 in - 2 minutes read - Last modified on 06 March 2021

Impala File Formats

Using the best file format is crucial for getting great performances. This is one reason with JSON is no supported in the Impala application. Indeed, parsing or retrieving all the text record even for one field would damage the performance badly. Impala is recommending a series of alternativeformats.

 

We show here how to create a Hive table in Avro format containing json data and a table in the new parquet format. We are using the same Yelp data from Episode 2 of the Season 2 of the Hadoop Tutorial series.

 

Avro

The first step is to convert our data into JSON with the help of a Pig script. Open up the Pig Editor and run:

 

REGISTER piggybank.jar

data = load '/user/hive/warehouse/review/yelp_academic_dataset_review_clean.json'

AS (funny:INT, useful:INT, cool:INT, user_id:CHARARRAY, review_id:CHARARRAY, text:CHARARRAY, business_id:CHARARRAY, stars:INT, date:CHARARRAY, type:CHARARRAY);

data_clean = FILTER data BY business_id IS NOT NULL AND text IS NOT NULL;

STORE data_clean INTO 'impala/reviews_avro'

USING org.apache.pig.piggybank.storage.avro.AvroStorage(

'{

"schema": {

"name": "review",

"type": "record",

"fields": [

{"name":"funny", "type":"int"},

{"name":"useful", "type":"int"},

{"name":"cool", "type":"int"},

{"name":"user_id", "type":"string"}

{"name":"review_id", "type":"string"},

{"name":"text", "type":"string"},

{"name":"business_id", "type":"string"},

{"name":"stars", "type":"int"},

{"name":"date", "type":"string"},

{"name":"type", "type":"string"},

]}

}');

 

Then, in the Hive Editor create the table with:

 

CREATE TABLE review_avro

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS

inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION '/user/romain/impala/reviews_avro'

tblproperties ('avro.schema.literal'='{

"name": "review",

"type": "record",

"fields": [

{"name":"business_id", "type":"string"},

{"name":"cool", "type":"int"},

{"name":"date", "type":"string"},

{"name":"funny", "type":"int"},

{"name":"review_id", "type":"string"},

{"name":"stars", "type":"int"},

{"name":"text", "type":"string"},

{"name":"type", "type":"string"},

{"name":"useful", "type":"int"},

{"name":"user_id", "type":"string"}]}'

);

 

You can now go back to Impala, and use the table after having refreshed the metadata with:

 

REFRESH avro_table

 

Parquet

Parquet is a new column-oriented binary file format, particularly efficient in Impala. Here is how to create a table from the Impala app:

 

CREATE TABLE review_parquet LIKE review STORED AS PARQUETFILE;

 

And then load data:

 

INSERT OVERWRITE review_parquet SELECT * FROM review;

 

Take the time to read about the goal of each format and how to enable compression. If you want to know more, the Impala tuning guide is a good reference too.

 

As usual feel free to comment on the hue-user list or @gethue!

 


comments powered by Disqus

More recent stories

26 June 2024
Integrating Trino Editor in Hue: Supporting Data Mesh and SQL Federation
Read More
03 May 2023
Discover the power of Apache Ozone using the Hue File Browser
Read More