Introducing Amazon S3 Support in Hue

Introducing Amazon S3 Support in Hue

We’re very excited to officially introduce Amazon S3 (Amazon Simple Storage Service) integration in Hue with Hue’s 3.11 release. Hue can be setup to read and write to a configured S3 account, and users can directly query from and save data to S3 without any intermediate moving/copying to HDFS.

S3 Configuration in Hue

Hue’s filebrowser can now allow users to explore, manage, and upload data in an S3 account, in addition to HDFS.

In order to add an S3 account to Hue, you’ll need to configure Hue with valid S3 credentials, including the access key ID and secret access key: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html

These keys can securely stored in a script that outputs the actual access key and secret key to stdout to be read by Hue (this is similar to how Hue reads password scripts). In order to use script files, add the following section to your hue.ini configuration file:

[aws]
[[aws_accounts]]
[[[default]]]
access_key_id_script=/path/to/access_key_script
secret_access_key_script= /path/to/secret_key_script
allow_environment_credentials=false
region=us-east-1

Alternatively (but not recommended for production or secure environments), you can set the access_key_id and secret_access_key values to the plain-text values of your keys:

[aws]
[[aws_accounts]]
[[[default]]]
access_key_id=s3accesskeyid
secret_access_key=s3secretaccesskey
allow_environment_credentials=false
region=us-east-1

The region should be set to the AWS region corresponding to the S3 account. By default, this region will be set to ‘us-east-1’.

Integrating Hadoop with S3

In addition to configuring Hue with your S3 credentials, Hadoop will also need to be configured with the S3 authentication credentials in order to read from and save to S3. This can be done by setting the following properties in your core-site.xml file:

<property>
  <name>fs.s3a.awsAccessKeyId</name>
  <value>AWS access key ID</value>
</property/>

<property>
  <name>fs.s3a.awsSecretAccessKey</name>
  <value>AWS secret key</value>
</property/>

For more information see http://wiki.apache.org/hadoop/AmazonS3

With Hue and Hadoop configured, we can verify that Hue is able to successfully connect to your S3 account by restarting Hue and checking the configuration page. You should not see any errors related to AWS, and you should notice an additional dropdown option in the Filebrowser menu from the main navigation:

Hue S3 Configuration

Exploring S3 in Hue’s Filebrowser

Once Hue is successfully configured to connect to S3, we can view all accessible buckets within the account by clicking on the S3 root.

Users can also create new buckets or delete existing buckets from this view.

NOTE: Unique Bucket Names

❗️ S3 bucket names must be unique across all regions. Hue will raise an error if you attempt to create or rename a bucket with a reserved name.

However, in most cases users will be working directly with keys within a bucket. From the buckets-view, users can click on a bucket to expand its contents. From here, we can view the existing keys (both directories and files) and create, rename, move, copy, or delete existing directories and files. Additionally, we can directly upload files to S3

S3 in Filebrowser

Create Hive Tables Directly From S3

Hue’s Metastore Import Data Wizard can create external Hive tables directly from data directories in S3. This allows S3 data to be queried via SQL from Hive or Impala, without moving or copying the data into HDFS or the Hive Warehouse.

To create an external Hive table from S3, navigate to the Metastore app, select the desired database and then click the “Create a new table from a file” icon in the upper right.

Enter the table name and optional description, and in the “Input File or Directory” filepicker, select the S3A filesystem and navigate to the parent directory containing the desired data files and click the “Select this folder” button. The “Load Data” dropdown should automatically select the “Create External Table” option which indicates that this table will directly reference an external data directory.

Choose your input files’ delimiter and column definition options and finally click “Create Table” when you’re ready to create the Hive table. Once created, you should see the newly created table details in the Metastore.

Hive Table from S3

Save Query Results to S3

Now that we have created external Hive tables created from our S3 data, we can jump into either the Hive or Impala editor and start querying the data directly from S3 seamlessly. These queries can join tables and objects that are backed either by S3, HDFS, or both. Query results can then easily be saved back to S3.

Query S3 Data and Save

TIP: Impala and S3

💡 For further advanced use-cases with Impala and S3, read: Analytics and BI on Amazon S3 with Apache Impala (Incubating).

What’s Next

Hue 3.11’s seamless support for S3 as an additional filesystem is just the beginning of a long-term roadmap for greater data flexibility and portability in the Cloud. Stay tuned for future enhancements like cross file transfers, execution and schedule of queries directly from the object store… that will provide a tighter integration between HDFS, S3, and additional filesystems.

As always, if you have any questions, feel free to comment here or on the hue-user list or @gethue!

15 Comments

  1. Miguel 5 months ago

    Hi

    After making changes in config files “hue.ini” and “core-site.xml”, how do I restart Hue?

    Thanks

    • Author
      Hue Team 5 months ago

      Which version do you use? If upstream, just kill its process, if using CM just restart it in the UI, if using packages use the restart command

      • PLM 4 months ago

        What is the proper command to restart hue on aws emr? I run “sudo stop/start hue” but the web admin site does not go back online and is unavailable. I’ve been scratching my head over this. Thanks

        • Author
          Hue Team 4 months ago

          Have you asked on the EMR forum too?

  2. Michael 5 months ago

    Can you make a video on S3 Configuration in Hue? It is not clear from the instructions.
    Is the only thing I change the access key and secret access key. If so, it is not working. Do I put in my own account under aws_accounts?

    [aws]
    [[aws_accounts]]
    [[[default]]]
    access_key_id=s3accesskeyid
    secret_access_key=s3secretaccesskey
    allow_environment_credentials=false
    region=us-east-1

  3. Nick 4 months ago

    I have an aws emr hue installed cluster . I want to create a user in hue that only have read access to my S3 . Is this possible?

    • Author
      Hue Team 4 months ago

      Yes, you can give keys from an IAM user that has only read only S3 perms.

  4. Ben 3 months ago

    I have edited hue.ini and core-site.xml as instructed above and restarted Hue, but the File Browser dropdown menu has not changed. There is no S3 Browser option. I have verified that Hue is using the updated hue.ini and I have checked that I entered my AWS keys etc correctly. How might I troubleshoot this, please?

    • Ben 3 months ago

      Nevermind. I’m using version 3.10, not 3.11.

  5. Saugata ghosh 2 months ago

    How can I install hue 3.11 on aws EMR cluster, this by default comes with 3.10

    • Author
      Hue Team 2 months ago

      EMR ships Hue 3.10. If you want 3.11 you will have to install it yourself on the machines

  6. Evan 1 month ago

    Thanks for the great work on HUE!

    I am using HUE. I have a question:

    I have 2 clusters. Cluster A is for computing, Cluster B is for storing data.

    I would like the file browser to use cluster B’s HDFS, while submitting jobs to cluster A (oozie’s workspace is in cluster A).

    How can I do it? Is there a config in Hue, or I need to do some coding changes?

    Thanks a lot!

  7. Hrishikesh Khatavkar 2 weeks ago

    I am using hue version 3.8.1 and trying to connect with s3.

    also i have done changes in config files “hue.ini” and “core-site.xml” but unable to make connectivity with s3.

    Error-: FAILED: RuntimeException java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found Intercepting System.exit(40000) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [40000]

    Thanks a lot!

    • Author
      Hue Team 2 weeks ago

      Did you configure Oozie too?

      <property>
         <name>oozie.service.HadoopAccessorService.supported.filesystems</name>
            <value>*</value>
            <description>
               Enlist the different filesystems supported for federation. If wildcard "*" is specified,
               then ALL file schemes will be allowed.
         </description>
      </property>
      

Leave a reply

Your email address will not be published. Required fields are marked *

*