Browsing ADLS data, querying it with SQL and exporting the results back in Hue 4.2

Browsing ADLS data, querying it with SQL and exporting the results back in Hue 4.2

We’re happy to present Microsoft Azure Data Lake Store (ADLS) integration in Hue with Hue 4.2 release. Similarly to the S3 integration, Hue can be setup to read and write to a configured ADLS, and users can directly query from and save data to ADLS without any intermediate moving / copying to HDFS.

For a detailed video walkthrough of file manipulation using ADLS in Hue, have a look:

In case you missed the one for S3, here is the link to the post.

 

Exploring ADLS in Hue’s file browser

Once Hue is successfully configured to connect to ADLS, we can view all accessible folders within the account by clicking on the ADLS root. From here, we can view the existing keys (both directories and files) and create, rename, move, copy, or delete existing directories and files. Additionally, we can directly upload files to ADLS.

Create Hive Tables Directly From ADLS

Hue’s table browser import wizard can create external Hive tables directly from files in ADLS. This allows ADLS data to be queried via SQL from Hive or Impala, without moving or copying the data into HDFS or the Hive Warehouse. To create an external Hive table from ADLS, navigate to the table browser, select the desired database and then click the plus icon in the upper right. Select a file using the file picker and browse to a file on ADLS.

Choose your input files’ delimiter and press next. Keep unchecked “Store in Default location” if you want the file to stay intact on ADLS, update the column definition options and finally click “Submit” when you’re ready to create the Hive table. Once created, you should see the newly created table details in the table browser.

Save Query Results to ADLS

Now that we have created external Hive tables created from our ADLS data, we can jump into either the Hive or Impala editor and start querying the data directly from ADLS seamlessly. These queries can join tables and objects that are backed either by ADLS, HDFS, or both. Query results can then easily be saved back to ADLS.

 

ADLS Configuration in Hue

Hue’s file browser can now allow users to explore, manage, and upload data in an ADLS, in addition to HDFS and S3.

In order to add an ADLS account to Hue, you’ll need to configure Hue with valid ADLS credentials, including the client ID, client secret and tenant ID.
These keys can securely stored in a script that outputs the actual access key and secret key to stdout to be read by Hue (this is similar to how Hue reads password scripts). In order to use script files, add the following section to your hue.ini configuration file:

[adls]
[[azure_accounts]]
[[[default]]]
client_id_script=/path/to/client_id_script.sh
client_secret_script=/path/to/client_secret_script.sh
tenant_id_script=/path/to/tenant_id_script.sh

[[adls_clusters]]
[[[default]]]
fs_defaultfs=adl://<account_name>.azuredatalakestore.net
webhdfs_url=https://<account_name>.azuredatalakestore.net

Alternatively (but not recommended for production or secure environments), you can set the client_secret value in plain-text:

[adls]
[[azure_accounts]]
[[[default]]]
client_id_script=/path/to/client_id_script.sh
client_secret_script=/path/to/client_secret_script.sh
tenant_id_script=/path/to/tenant_id_script.sh

[[adls_clusters]]
[[[default]]]
fs_defaultfs=adl://<account_name>.azuredatalakestore.net
webhdfs_url=https://<account_name>.azuredatalakestore.net

Alternatively (but not recommended for production or secure environments), you can set the client_secret value in plain-text:

[adls]
[[azure_account]]
[[[default]]]
client_id=adlsclientid
client_secret=adlsclientsecret
tenant_id=adlstenantid

[[adls_clusters]]
[[[default]]]
fs_defaultfs=adl://<account_name>.azuredatalakestore.net
webhdfs_url=https://<account_name>.azuredatalakestore.net

Integrating Hadoop with ADLS

In addition to configuring Hue with your ADLS credentials, Hadoop will also need to be configured with the ADLS authentication credentials in order to read from and save to ADLS. This can be done by setting the following properties in your core-site.xml file:

<property>
  <name>fs.adl.oauth2.access.token.provider.type</name>
  <value>ClientCredential</value>
</property/>
 
<property>
  <name>fs.adl.oauth2.refresh.url</name>
  <value>https://login.microsoftonline.com/<tenant_id>/oauth2/token</value>
</property/>

<property>
  <name>fs.adl.oauth2.client.id</name>
  <value>adlsclientid</value>
</property/>

<property>
  <name>fs.adl.oauth2.credential</name>
  <value>adlsclientsecret</value>
</property/>

With Hue and Hadoop configured, we can verify that Hue is able to successfully connect to ADLS by restarting Hue and checking the configuration page. You should not see any errors related to ADLS, and you should notice an additional option in the menu from the main navigation.

For performance numbers on ADLS, please read: A look at ADLS Performance.
As always, if you have any questions, feel free to comment here or on the hue-user list or @gethue!

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*