Introducing Amazon S3 Support in Hue

Introducing Amazon S3 Support in Hue

We’re very excited to officially introduce Amazon S3 (Amazon Simple Storage Service) integration in Hue with Hue’s 3.11 release. Hue can be setup to read and write to a configured S3 account, and users can directly query from and save data to S3 without any intermediate moving/copying to HDFS.

S3 Configuration in Hue

Hue’s filebrowser can now allow users to explore, manage, and upload data in an S3 account, in addition to HDFS.

In order to add an S3 account to Hue, you’ll need to configure Hue with valid S3 credentials, including the access key ID and secret access key: http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html

These keys can securely stored in a script that outputs the actual access key and secret key to stdout to be read by Hue (this is similar to how Hue reads password scripts). In order to use script files, add the following section to your hue.ini configuration file:

[aws]
[[aws_accounts]]
[[[default]]]
access_key_id_script=/path/to/access_key_script
secret_access_key_script= /path/to/secret_key_script
allow_environment_credentials=false
region=us-east-1

Alternatively (but not recommended for production or secure environments), you can set the access_key_id and secret_access_key values to the plain-text values of your keys:

[aws]
[[aws_accounts]]
[[[default]]]
access_key_id=s3accesskeyid
secret_access_key=s3secretaccesskey
allow_environment_credentials=false
region=us-east-1

The region should be set to the AWS region corresponding to the S3 account. By default, this region will be set to ‘us-east-1’.

Integrating Hadoop with S3

In addition to configuring Hue with your S3 credentials, Hadoop will also need to be configured with the S3 authentication credentials in order to read from and save to S3. This can be done by setting the following properties in your core-site.xml file:

<property>
  <name>fs.s3a.awsAccessKeyId</name>
  <value>AWS access key ID</value>
</property/>

<property>
  <name>fs.s3a.awsSecretAccessKey</name>
  <value>AWS secret key</value>
</property/>

For more information see http://wiki.apache.org/hadoop/AmazonS3

With Hue and Hadoop configured, we can verify that Hue is able to successfully connect to your S3 account by restarting Hue and checking the configuration page. You should not see any errors related to AWS, and you should notice an additional dropdown option in the Filebrowser menu from the main navigation:

Hue S3 Configuration

Exploring S3 in Hue’s Filebrowser

Once Hue is successfully configured to connect to S3, we can view all accessible buckets within the account by clicking on the S3 root.

Users can also create new buckets or delete existing buckets from this view.

NOTE: Unique Bucket Names

❗️ S3 bucket names must be unique across all regions. Hue will raise an error if you attempt to create or rename a bucket with a reserved name.

However, in most cases users will be working directly with keys within a bucket. From the buckets-view, users can click on a bucket to expand its contents. From here, we can view the existing keys (both directories and files) and create, rename, move, copy, or delete existing directories and files. Additionally, we can directly upload files to S3

S3 in Filebrowser

Create Hive Tables Directly From S3

Hue’s Metastore Import Data Wizard can create external Hive tables directly from data directories in S3. This allows S3 data to be queried via SQL from Hive or Impala, without moving or copying the data into HDFS or the Hive Warehouse.

To create an external Hive table from S3, navigate to the Metastore app, select the desired database and then click the “Create a new table from a file” icon in the upper right.

Enter the table name and optional description, and in the “Input File or Directory” filepicker, select the S3A filesystem and navigate to the parent directory containing the desired data files and click the “Select this folder” button. The “Load Data” dropdown should automatically select the “Create External Table” option which indicates that this table will directly reference an external data directory.

Choose your input files’ delimiter and column definition options and finally click “Create Table” when you’re ready to create the Hive table. Once created, you should see the newly created table details in the Metastore.

Hive Table from S3

Save Query Results to S3

Now that we have created external Hive tables created from our S3 data, we can jump into either the Hive or Impala editor and start querying the data directly from S3 seamlessly. These queries can join tables and objects that are backed either by S3, HDFS, or both. Query results can then easily be saved back to S3.

Query S3 Data and Save

TIP: Impala and S3

💡 For further advanced use-cases with Impala and S3, read: Analytics and BI on Amazon S3 with Apache Impala (Incubating).

 

Using Ceph

New end points have been added in https://issues.cloudera.org/browse/HUE-5420

What’s Next

Hue 3.11’s seamless support for S3 as an additional filesystem is just the beginning of a long-term roadmap for greater data flexibility and portability in the Cloud. Stay tuned for future enhancements like cross file transfers, execution and schedule of queries directly from the object store… that will provide a tighter integration between HDFS, S3, and additional filesystems.

As always, if you have any questions, feel free to comment here or on the hue-user list or @gethue!

29 Comments

  1. Miguel 10 months ago

    Hi

    After making changes in config files “hue.ini” and “core-site.xml”, how do I restart Hue?

    Thanks

    • Author
      Hue Team 10 months ago

      Which version do you use? If upstream, just kill its process, if using CM just restart it in the UI, if using packages use the restart command

      • PLM 9 months ago

        What is the proper command to restart hue on aws emr? I run “sudo stop/start hue” but the web admin site does not go back online and is unavailable. I’ve been scratching my head over this. Thanks

        • Author
          Hue Team 9 months ago

          Have you asked on the EMR forum too?

  2. Michael 10 months ago

    Can you make a video on S3 Configuration in Hue? It is not clear from the instructions.
    Is the only thing I change the access key and secret access key. If so, it is not working. Do I put in my own account under aws_accounts?

    [aws]
    [[aws_accounts]]
    [[[default]]]
    access_key_id=s3accesskeyid
    secret_access_key=s3secretaccesskey
    allow_environment_credentials=false
    region=us-east-1

  3. Nick 9 months ago

    I have an aws emr hue installed cluster . I want to create a user in hue that only have read access to my S3 . Is this possible?

    • Author
      Hue Team 9 months ago

      Yes, you can give keys from an IAM user that has only read only S3 perms.

  4. Ben 8 months ago

    I have edited hue.ini and core-site.xml as instructed above and restarted Hue, but the File Browser dropdown menu has not changed. There is no S3 Browser option. I have verified that Hue is using the updated hue.ini and I have checked that I entered my AWS keys etc correctly. How might I troubleshoot this, please?

    • Ben 8 months ago

      Nevermind. I’m using version 3.10, not 3.11.

  5. Saugata ghosh 7 months ago

    How can I install hue 3.11 on aws EMR cluster, this by default comes with 3.10

    • Author
      Hue Team 7 months ago

      EMR ships Hue 3.10. If you want 3.11 you will have to install it yourself on the machines

  6. Evan 6 months ago

    Thanks for the great work on HUE!

    I am using HUE. I have a question:

    I have 2 clusters. Cluster A is for computing, Cluster B is for storing data.

    I would like the file browser to use cluster B’s HDFS, while submitting jobs to cluster A (oozie’s workspace is in cluster A).

    How can I do it? Is there a config in Hue, or I need to do some coding changes?

    Thanks a lot!

  7. Hrishikesh Khatavkar 6 months ago

    I am using hue version 3.8.1 and trying to connect with s3.

    also i have done changes in config files “hue.ini” and “core-site.xml” but unable to make connectivity with s3.

    Error-: FAILED: RuntimeException java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found Intercepting System.exit(40000) Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [40000]

    Thanks a lot!

    • Author
      Hue Team 6 months ago

      Did you configure Oozie too?

      <property>
         <name>oozie.service.HadoopAccessorService.supported.filesystems</name>
            <value>*</value>
            <description>
               Enlist the different filesystems supported for federation. If wildcard "*" is specified,
               then ALL file schemes will be allowed.
         </description>
      </property>
      
  8. Barrie Wheeler 5 months ago

    We’re using S3, but it’s a private datastore, via Ceph and the RADOS Gateway. Not using Amazon for the S3 account.

    Contents of the Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini:
    [aws]
    [[aws_accounts]]
    [[[default]]]
    http_url=http://192.168.72.137:80 <<– this is a guess, but doesn't work. It's the RADOS gateway (RGW) IP addr.
    access_key_id=001
    secret_access_key=100

    The credentials for the RGW (as listed in a COSBench workload file) are:

    I can access the RGW and associated buckets from the HDP command line:
    sudo -u hdfs hadoop fs -ls s3a://quakes/input
    Found 1 items
    -rw-rw-rw- 1 hdfs 1658339 2017-01-27 14:40 s3a://quakes/input/all_month.csv

    But I can’t access S3 from the Hue GUI, although the entry for the S3 Browser shows up. When I click on that, it just says “waiting on localhost”, then times out. Would be very useful to find out how to connect the Hue web gui to S3 datastores other than Amazon. I think it would open up use of the gui to new customers (like us).

    Your thoughts appreciated!

    BG

  9. Barrie Wheeler 5 months ago

    Thanks for the inputs. I’ve synced my cluster files with your revised files:
    /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/desktop/conf/hue.ini
    /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/desktop/libs/aws/src/aws/conf.py
    /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/desktop/libs/aws/src/aws/client.py
    /opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/desktop/conf/pseudo-distributed.ini.tmpl (I just added this. It wasn’t here originally – wasn’t sure if it was needed.

    When I star Hue, is see this error (stderr):
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/bin/hue”, line 12, in
    load_entry_point(‘desktop==3.9.0’, ‘console_scripts’, ‘hue’)()
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/desktop/core/src/desktop/manage_entry.py”, line 59, in entry
    execute_from_command_line(sys.argv)
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/management/__init__.py”, line 399, in execute_from_command_line
    utility.execute()
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/management/__init__.py”, line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/management/__init__.py”, line 261, in fetch_command
    commands = get_commands()
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/management/__init__.py”, line 107, in get_commands
    apps = settings.INSTALLED_APPS
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/conf/__init__.py”, line 54, in __getattr__
    self._setup(name)
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/conf/__init__.py”, line 49, in _setup
    self._wrapped = Settings(settings_module)
    File “/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/conf/__init__.py”, line 132, in __init__
    % (self.SETTINGS_MODULE, e)
    ImportError: Could not import settings ‘desktop.settings’ (Is it on sys.path? Is there an import error in the settings file?): cannot import name is_default_configured

    No luck yet on what’s causing it. What am I missing? Thanks for the assistance.

    • Author
      Hue Team 5 months ago

      ‘cannot import name is_default_configured’ means you need to sync more files, as it is trying to import the ‘is_default_configured’ function which is not there

  10. Michael Arnold 4 months ago

    Is it possible to configure S3 access without providing access_key_id/secret_access_key? I have Hue running on an AWS instance that has an associated IAM role and this should allow S3 connectivity without manually providing access keys to Hue.

  11. Nikhil Srinidhi 4 months ago

    When i set the region to ca-central-1 it thrown an “unknown region” error. Is there a way around this?

  12. Barrie Wheeler 4 months ago

    I upgraded Hue to 3.12, which contains the updated code for accessing non-AWS S3 Storage.

    The Hue GUI works fine for HDFS, but still getting Cannot Access S3A:// Timed Out
    Server Logs:
    [21/Feb/2017 15:53:36 -0700] access WARNING 10.2.28.161 hdfs – “GET /logs HTTP/1.1”
    [21/Feb/2017 15:53:31 -0700] access DEBUG 127.0.0.1 -anon- – “HEAD /desktop/debug/is_alive HTTP/1.1”
    [21/Feb/2017 15:53:13 -0700] resource DEBUG GET Got response: {“apps”:null}
    [21/Feb/2017 15:53:13 -0700] connectionpool DEBUG “sm101.lab7217.local:8088 GET /ws/v1/cluster/apps?user=hdfs&finalStatus=UNDEFINED&limit=1000&user.name=hue&doAs=hdfs HTTP/1.1” 200 None
    [21/Feb/2017 15:53:13 -0700] access INFO 10.2.28.161 hdfs – “POST /jobbrowser/jobs/ HTTP/1.1”

    Hue Service Advanced Configuration Snippet (Safety Valve) for hue_safety_valve.ini:
    [desktop]
    allowed_hosts=*

    [aws]
    [[aws_accounts]]
    [[[default]]]
    proxy_address=http://192.168.72.137
    proxy_port=80
    access_key_id=001
    secret_access_key=100
    is_secure=true
    calling_format=boto.s3.connection.S3Connection.DefaultCallingFormat

    If these settings are made in the Cloudera Manager, I assume no direct editing of the hue.ini, client.py, or conf.py is required. But documentation is not clear on this.
    Can access S3A fine from the CLI, so core-site S3 credentials are good. But my hue configuration is still off.
    What can I check? Thanks.

  13. Barrie Wheeler 4 months ago

    Correction on the above:
    Error msg in Hue GUI is: Cannot access: S3A://. [Errno 22] Unknown scheme s3a, available schemes: [‘hdfs’]
    Server Logs:
    [21/Feb/2017 16:46:44 -0700] access WARNING 10.2.28.161 hdfs – “GET /logs HTTP/1.1”
    [21/Feb/2017 16:46:34 -0700] access DEBUG 127.0.0.1 -anon- – “HEAD /desktop/debug/is_alive HTTP/1.1”
    [21/Feb/2017 16:46:33 -0700] resource DEBUG GET Got response: {“apps”:null}
    [21/Feb/2017 16:46:33 -0700] connectionpool DEBUG “sm101.lab7217.local:8088 GET /ws/v1/cluster/apps?finalStatus=UNDEFINED&limit=1000&user.name=hue&user=hdfs&startedTimeBegin=1487115993000&doAs=hdfs HTTP/1.1” 200 None
    [21/Feb/2017 16:46:33 -0700] access INFO 10.2.28.161 hdfs – “POST /jobbrowser/jobs/ HTTP/1.1”
    [21/Feb/2017 16:46:27 -0700] resource DEBUG GET Got response: {“apps”:null}
    [21/Feb/2017 16:46:27 -0700] connectionpool DEBUG “sm101.lab7217.local:8088 GET /ws/v1/cluster/apps?finalStatus=UNDEFINED&limit=1000&user.name=hue&user=hdfs&startedTimeBegin=1487115987000&doAs=hdfs HTTP/1.1” 200 None
    [21/Feb/2017 16:46:27 -0700] access INFO 10.2.28.161 hdfs – “POST /jobbrowser/jobs/ HTTP/1.1”

    Thanks.

    • Gwen 4 months ago
    • Barrie Wheeler 4 months ago

      Additional Detail on “S3A not accessible”:

      [22/Feb/2017 14:24:43 -0700] middleware INFO Processing exception: Cannot access: S3A://. : Traceback (most recent call last):
      File “/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/core/handlers/base.py”, line 112, in get_response
      response = wrapped_callback(request, *callback_args, **callback_kwargs)
      File “/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hue/build/env/lib/python2.7/site-packages/Django-1.6.10-py2.7.egg/django/db/transaction.py”, line 371, in inner
      return func(*args, **kwargs)
      File “/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hue/apps/filebrowser/src/filebrowser/views.py”, line 205, in view
      raise PopupException(msg , detail=e)
      PopupException: Cannot access: S3A://.

      [22/Feb/2017 14:24:43 -0700] exceptions_renderable ERROR Potential trace: [(‘/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hue/apps/filebrowser/src/filebrowser/views.py’, 187, ‘view’, ‘stats = request.fs.stats(path)’), (‘/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hue/desktop/core/src/desktop/lib/fs/proxyfs.py’, 117, ‘stats’, ‘return self._get_fs(path).stats(path)’), (‘/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hue/desktop/core/src/desktop/lib/fs/proxyfs.py’, 77, ‘_get_fs’, “raise IOError(errno.EINVAL, ‘Unknown scheme %s, available schemes: %s’ % (scheme, self._fs_dict.keys()))”)]
      [22/Feb/2017 14:24:43 -0700] exceptions_renderable ERROR Potential detail: [Errno 22] Unknown scheme s3a, available schemes: [‘hdfs’]
      [22/Feb/2017 14:24:43 -0700] access INFO 10.2.28.161 hdfs – “GET /filebrowser/view=S3A:// HTTP/1.1”

      It’s unclear why S3A is not accessible or unknown and not sure how to interpret these log exceptions. I can access S3A fine from the CLI, but within filebrowser from Hue, no luck.
      Again, this is S3 storage on a local Ceph datastore, not Amazon, so the Amazon config doesn’t assist.
      Thoughts appreciated.

  14. Monica 1 month ago

    Hi,
    I am using hue 3.12 with an Emr cluster. Without the edits to the files mentioned above I can see my s3 buckets but can’t get into them. I get a bad request error. If I make the changes above I can’t see my buckets at all. Any help would be greatly appreciated.
    Thanks!

    • Author
      Hue Team 1 month ago

      What is the error you have in /logs of Hue after trying to open a bucket?
      You might also need to specify the region in the [[aws]] config in the hue.ini.

Leave a reply

Your email address will not be published. Required fields are marked *

*