Querying & Exploring the Instacart dataset Part 1: Ingesting the data

Published on 22 March 2019 in Version 4 - 2 minutes read - Last modified on 06 March 2021

Self-service exploratory analytics is one of the most common use cases of the Hue users. In this tutorial, let's see how to get started on the analysis. We will use the free Instacart dataset and start with the Importer feature.

Getting the data

This steps was made particularly easy by Instacart. Just go on their dataset page of 3 million orders and download the 200 MBs.

Making it queryable

Next step is not always trivial. In our case, there is no data team adding the dataset to the Data Catalog for us, but hopefully we can use the Data Importer of Hue.

Upload to the object store

First upload the dataset to the cluster. This is easy via the File Browser.

Then, the next step is to uncompress the archive. This is also convenient to do in two clicks via the File Browser. Note that the processing is happening in the cluster, not on your machine, and it is an efficient way to upload multiple files.

Load via the importer

Via the top left Hamburger icon that will open this menu, click in the very bottom. Or use ‘+’ icon in the top of the left SQL Assist. This will open-up the Importer.

From there, go select the ‘orders’ file that was extracted from the Instacart archive. A File and Table previews are shown automatically.

Click next to go to step 2. Hue auto-detects the types of the columns and checks if the names are valid. In more advanced scenarios, the user could also change the type of the table (e.g. by selecting the Apache Parquet or Apache Kudu format)

Click ‘Submit’ and afterwards the table will appear in the Data Catalog!

Note: for advanced users, the SQL command to create the table and import the data can also be printed.

In next episode

Repeat with the ‘products’ file and now you are ready to start querying! We will start from there in the upcoming post of this series.

Note: the importer supports multiple outputs like Solr Dashboards or inputs like regular databases.

 

As usual feel free to comment here or to send feedback to the hue-user list or @gethue!


comments powered by Disqus

More recent stories

26 June 2024
Integrating Trino Editor in Hue: Supporting Data Mesh and SQL Federation
Read More
03 May 2023
Discover the power of Apache Ozone using the Hue File Browser
Read More