In the previous episodes, we presented how to schedule repetitive worflows on the grid with Oozie Coordinator. Let’s now look at a fast way to query some data with Impala.
Hue, the Hadoop UI, has been supporting Impala closely since its first version and brings fast interactive queries within your browser. If not familiar with Impala, we recommend you to check the documentation of the fastest SQL engine for Hadoop.
Most of Hive SQL is compatible with Impala and we are going to compare the queries of episode one in both Hive and Impala applications. Notice that this comparison is not 100% scientific but it demonstrates what would happen in common cases.
Using Impala through the Hue app is easier in many ways than using it through the command-line impala-shell. For example, table names, databases, columns, built-in functions are auto-completable and the syntax highlighting shows the potential typos in your queries. Multiple queries or a selected portion of a query can be executed from the editor. Parameterized queries are supported and the user will be prompted for values at submission time. Impala queries can be saved and shared between users or deleted and then restored from trash in case of mistakes.
Impala uses the same Metastore as Hive so you can browse tables with the Metastore app. You can also pick a database with a drop-down in the editor. After submission, progress and logs are reported and you can browse the result with infinite scroll or download the data with your browser.
Query speed comparison
Let’s start with the Hue examples as they are easily accessible. They are very small but show the lightning speed of Impala and the inefficiency of the series of MapReduce jobs created by Hive.
Make sure the Hive and Impala examples are installed in Hue and then in each app, go to ‘Saved Queries’, copy the query ‘Sample: Top salaries’ and submit it.
Then we are back to our Yelp data. Let’s take the query from episode one and execute it in both apps:
SELECT r.business_id, name, SUM(cool) AS coolness FROM review r JOIN business b ON (r.business_id = b.business_id) WHERE categories LIKE '%Restaurants%' AND \`date\` = '$date' GROUP BY r.business_id, name ORDER BY coolness DESC LIMIT 10
Again, you can see the benefits of Impala’s architecture and optimization.
This post described how Impala query execution makes data analysis interactive and more productive than Hive’s batch architecture. Results come back fast, and in our Yelp data case, instantaneously. Impala and Hue combined are a recipe for fast analytics. Moreover, Hue’s Python API can also be reused if you want to build your own client.