Querying & Exploring the Instacart dataset Part 1: Ingesting the data

Published on 22 March 2019 in Browsing / Querying / Version 4 - 2 minutes read - Last modified on 04 February 2020

Self-service exploratory analytics is one of the most common use cases of the Hue users. In this tutorial, let's see how to get started on the analysis. We will use the free Instacart dataset and start with the Importer feature.

Getting the data

This steps was made particularly easy by Instacart. Just go on their dataset page of 3 million orders and download the 200 MBs.

Making it queryable

Next step is not always trivial. In our case, there is no data team adding the dataset to the Data Catalog for us, but hopefully we can use the Data Importer of Hue.

Upload to the object store

First upload the dataset to the cluster. This is easy via the File Browser.

Then, the next step is to uncompress the archive. This is also convenient to do in two clicks via the File Browser. Note that the processing is happening in the cluster, not on your machine, and it is an efficient way to upload multiple files.

Load via the importer

Via the top left Hamburger icon that will open this menu, click in the very bottom. Or use ‘+’ icon in the top of the left SQL Assist. This will open-up the Importer.

From there, go select the ‘orders’ file that was extracted from the Instacart archive. A File and Table previews are shown automatically.

Click next to go to step 2. Hue auto-detects the types of the columns and checks if the names are valid. In more advanced scenarios, the user could also change the type of the table (e.g. by selecting the Apache Parquet or Apache Kudu format)

Click ‘Submit’ and afterwards the table will appear in the Data Catalog!

Note: for advanced users, the SQL command to create the table and import the data can also be printed.

In next episode

Repeat with the ‘products’ file and now you are ready to start querying! We will start from there in the upcoming post of this series.

Note: the importer supports multiple outputs like Solr Dashboards or inputs like regular databases.


As usual feel free to comment here or to send feedback to the hue-user list or @gethue!

comments powered by Disqus

More recent stories

02 February 2021
Hue 4.9 and its new SQL dialects and components are out!
Read More
10 January 2021
SQL Querying a log stream and outputting Calculations to another stream
Read More
31 December 2020
A Spark SQL Editor via Hue and the Spark SQL Server
Read More