Query Editors

Assist and autocomplete

Autocomplete supports for complex types; structs, maps and arrays, both in the assist and autocomplete.

A longer more detailed version is available here.

Complex in a simple way…

The assist supports complex types for Hive and Impala. Just click on the column and it’ll expand any of the structs, maps and arrays. The assist panel resizable, which is very handy when you have long table or column names or deep nested structures.

Assist_complex

If you find what you’re looking for in the panel you can double-click the item to insert it at the cursor in the active editor. You can also drag-and-drop it anywhere in the editor if you prefer another location. It knows about the structure and will insert the reference with all the parents up to and including the column name.

Ctrl+space is your friend!

You can use ctrl+space anywhere in the editor to activate the autocomplete. It will make suggestion based on the schema and it will also take the contents of the statement that you’re editing into account. Apart from the SQL keywords, tables and columns it will even suggest user-defined aliases.

Autocomplete_join_alias

It knows about complex types for Hive and Impala and will make the appropriate suggestions based on the reference that you’re currently editing.

Autocomplete_nested_struct

I heard you like exploded views! The autocomplete will help you with these, it keeps track of the exploded views as well as exploded views of exploded views of exploded views of…

Autocomplete_exploded

Values and HDFS paths

We’ve even added sample values to the mix. It’s currently only available for Impala and it will suggest a subset of all the values, this is really nice when working with for instance map keys or to give you an idea of the data.

Autocomplete_sample_values

Last but not least, the autocomplete can now suggest HDFS paths. Just type ‘/’ in one of the notebook snippets and it will automatically open the autocomplete panel with the list of folders and files at that location, of course you can also activate it with ctrl+space when editing an existing path.

Autocomplete_hdfs

 

Browse Metadata

The app is a single page and offers speed and more accessible statistics and data preview.

New UI

Fresh restart

The front-end has been rewritten to be slicker and more user-friendly. More information is displayed and navigating across tabs is seamless as it no longer entails a page refresh.

blog-57-metastore-table

Table specific page after

blog-57-metastore-tables

Table list page

Table and column comments can be edited in a single click.

blog-edit-comments

 

Creation wizards

We also refreshed the wizard for creating tables and databases quickly, whether designing the schema manually or automatically from a sample file.

blog-57-create table

 

 

 

Data rich

Statistics

Stats are easier to access than before. Most of the database and table metadata are displayed on top. Top values, min, max can also be retrieved and refreshed directly from the UI. Links to HDFS locations are automatically added.blog-metastore-stats

blog-col-stats

Preview data

The first few rows of table data are displayed on the first page to provide users with a convenient quick preview.

blog-metastore-preview-data

 

Browse Partitions

Regardless of a table’s partition locations, Hue’s metastore now enables you to browse all the partitions in the table, by clicking the “Show Partitions” link from the table view. By default, the partitions view will sort the partitions in reverse order by name (or newest first, if partitioned by date) and display the first 250 partitions.

If you’re searching for a specific set of partitions however, you can now filter on a partition value by clicking the “Add a filter” link, selecting and specifying the filter parameter, then clicking the “Filter” button! Note that multiple partition filters can be added as needed to refine your query, and you can also disable the default sort order to retrieve partition results in alphabetically ascending order.

Filter, Sort and Browse Hive Partitions with Hue’s Metastore from The Hue Team on Vimeo.

Finally, you can view the data files in any partition by clicking the “View Partition Files” link which will take you to the filebrowser for that partition’s location.

Metastore Partition View

Partition File Browser

 

LDAP or PAM pass-through authentication with Hive or Impala and Impersonation

Hue is a server between users logged in their browsers and the respective Hadoop services. Consequently, Hue is seen as a single ‘hue’ user by the other servers.

Impersonation is used in order to still apply the permissions of the real logged-in user. For example when a user ‘bob’ submits a query, Hue also sends the username of this user and HiveServer2 will use ‘bob’ and not ‘hue’ as the owner of the query.

hue-auth-client

Hue supports multiple way to authenticate with the other servers: Kerberos and LDAP are common, as well as PAM.

In the next version of Hue, it is now possible to differentiate which authentication to use for either Hive or Impala (it used to be a unique common configuration). This for example let you configure Hue to use LDAP to talk to HiveServer2 and Kerberos for Impala.

usernames and passwords to use for LDAP, PAM are configurable in the main configuration section ([desktop]) and can be overridden in each respective apps.

In order to provide better security, it is also now possible to provide a path to a file that contains the password to use (instead of putting it in plain in the hue.ini). If the plain password is not set, the file will be used.

For example, here is how to configure a ‘hue’ user and password in a file for all the apps

[desktop]
auth_username=hue    
# auth_password=
auth_password_script=/path/to/ldap_password

If Hue needs to authenticate to HiveServer2 with some different username and password:

[beeswax]
auth_username=hue_hive
auth_password=hue_hive_pwd
# auth_password_script=

If Impala is not using LDAP authentication but Hive does, we disable it in [desktop] and do not specify anything in [impala]:

[desktop]
auth_username=hue    
# auth_password=
# auth_password_script=

[beeswax]
auth_username=hue_hive
auth_password=hue_hive_pwd

[impala]
# auth_username=
# auth_password=hue_impala
# auth_password_script=/

Note
Not setting any password will make the LDAP/PAM authentication inactive.

Note
SSL encryption between Hue and the other Hadoop services is also supported

Note

In CM’s “HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml” to add the configuration overrides to hive-site.xml.
Adding those configurations to: Hive > Configuration > Gateway > Advanced > Hive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xml. Then save and restart both Hive and Hue.  This should allow Hue to pickup the hive-site.xml changes

How to use Hue with Hive and Impala configured with LDAP authentication and SSL

We previously showed in detail how to use SSL encryption with the Impala or Hive Editors. Here is now a step by step guide about how to use LDAP authentication instead of no authentication or Kerberos.

Note: this requires Hue 3.7 or CDH5.2

1.
HiveServer2 had SSL enabled so Hive Editor could not connect to HiveServer2. HiveServer2 logs showed SSL errors indicating that it received plaintext (good hint at the cause)

Solved by adding this to the Hue Safety Valve:

[beeswax]
  [[ssl]]
  ## SSL communication enabled for this server.
  enabled=false
  ## Path to Certificate Authority certificates.
  cacerts=/etc/hue/cacerts.pem
  ## Path to the private key file.
  key=/etc/hue/key.pem
  ## Path to the public certificate file.
  cert=/etc/hue/cert.pem
  ## Choose whether Hue should validate certificates received from the server.
  validate=false

(validate was false since their certificates used wildcards and this caused other errors)

Note: If not using SSL, you will hit this bug: HUE-2484

2.
The same Hue behavior occurred after making the change, but now the HiveServer2 log showed authentication failure due to err=49

So, we added the following to the Hue Safety Valve:

[desktop]
  ldap_username=
  ldap_password=

3.
Hue still showed the same behavior. HiveServer2 logs showed:

<HUE_LDAP_USERNAME> is not allowed to impersonate bob

We solved this by adding the following to the HDFS > Service-Wide ->Advanced>Safety Valve for core-site.xml.

<property>
  <name>hadoop.proxyuser.<HUE_LDAP_USERNAME>.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.<HUE_LDAP_USERNAME>.groups</name>
  <value>*</value>
</property>

4.
After this, the default database was displayed, but we could not do a show tables; or anything else. Beeline had the same behavior.

We did a grant for the group to which the user who was attempting the Hive actions and then that problem went away.

All queries were working and Hue is querying Hive/Impala and returning results!

 

 

 

SSL Encryption between Hue and Hive

 

Hue can communicate with Hive over a channel encrypted with SSL. Let’s take a look at the interface and the handshake mechanism first before trying to secure it.

The basic high-level idea concept behind the SSL protocol handshake mechanism is shown in the diagram shown below, where Hue is the SSL Client, and Hive is the SSL Server:

huehive_img1

a. SSL Client (Hue) opens a socket connection and connects to Hive. This is then encapsulated with a wrapper that encrypts and decrypts the data going over the socket with SSL.

b. Once Hive receives an incoming connection, it shows a certificate to Hue (which is like a public key saying it can be trusted).

c. Hue can then verify the authenticity of this certificate with a trusted certificate-issuing authority, or it can be skipped for self-signed certificates.

d. Hue encrypts messages using this public key and sends data to Hive.

e. Hive decrypts the message with its private key.

The public/private keys always come in pairs and are used to encrypt/decrypt messages. These can be generated with the UNIX keytool command-line utility which is understood by the Java keystore library, or with the UNIX OpenSSL utility which is understood directly by the Python SSL library.

The Hive-side uses Java keystore certificates and public/private keys and Hue’s Python code calls the SSL library implemented in C. Much of the complication arises in not having one uniform format which can be understood by all languages—Python, Java and C. For example, the SSL C library on the client side expects a private key from the SSL server, which is not a requirement in a pure java SSL client implementation. Using the Java keytool command, you cannot export a private key directly into the pem format understood by Python. You need an intermediate PKCS12 format.

Let’s step through the procedure to create certificates and keys:

1) Generate keystore.jks containing private key (used by Hive to decrypt messages received from Hue over SSL) and public certificate (used by Hue to encrypt messages over SSL)

keytool -genkeypair -alias certificatekey -keyalg RSA -validity 7 -keystore
keystore.jks

2) Generate certificate from keystore

keytool -export -alias certificatekey -keystore keystore.jks -rfc -file
cert.pem

3) Export private key and certificate with openSSL for Hue’s SSL library to ingest.

Exporting the private key from a jks file (Java keystore) needs an intermediate PKCS12:

a. Import the keystore from JKS to PKCS12

keytool -importkeystore -srckeystore keystore.jks -destkeystore keystore.p12
-srcstoretype JKS -deststoretype PKCS12 -srcstorepass mysecret -deststorepass
mysecret -srcalias certificatekey -destalias certificatekey -srckeypass
mykeypass -destkeypass mykeypass -noprompt

b. Convert pkcs12 to pem using OpenSSL

openssl pkcs12 -in keystore.p12 -out keystore.pem -passin pass:mysecret
-passout pass:mysecret

c. Strip the pass phrase so Python doesn’t prompt for password while connecting to Hive

openssl rsa -in keystore.pem -out hue_private_keystore.pem

Then the following needs to be setup in Hue’s configuration file hue.ini under [beeswax] section:

  [[ssl]]
    # SSL communication enabled for this server. (optional since Hue 3.8)
    enabled=true

    # Path to Certificate Authority certificates. (optional)
    ## cacerts=/etc/hue/cacerts.pem
    # Choose whether Hue should validate certificates received from the server.
    validate=false

Then make sure no custom authentication mechanism is turned on and configure your hive-site.xml with the following properties on Hive 0.13:

<property>
 <name>hive.server2.session.check.interval</name>
 <value>3000</value>
 <description>The check interval for session/operation timeout, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
 <name>hive.server2.idle.session.timeout</name>
 <value>0</value>
 <description>Session will be closed when it's not accessed for this duration, which can be disabled by setting to zero or negative value.</description>
</property>

<property>
 <name>hive.server2.idle.operation.timeout</name>
 <value>0</value>
 <description>Operation will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero value. With positive value, it's checked for operations in terminal state only (FINISHED, CANCELED, CLOSED, ERROR). With negative value, it's checked for all of the operations regardless of state.</description>
</property>

 

 

Hive and Impala queries life cycle

 

What happens to the query results? How long are they kept? Why do they disappear sometimes? Why are some Impala queries are still “in flight” even if they are completed?   Each query is using some resources in Impala or HiveServer2. When the users submit a lot of queries, they are going to add up and crash the servers if nothing is done. A lof of improvements came in Hue 3 and CDH4.6 as Hue started to close automatically all the metadata queries. Here are the latest settings that you can tweak:

Impala

Hue tries to close the query when the user navigates away from the result page (as queries are generally fast, it is ok to close them quick). However, if the user never comes back checking the result of the query or never close the page, the query is going to stay.   Starting in Hue 3.7 and C5.2 (with HUE-2251), Impala is going to automatically expire the queries idle for than 10 minutes with the query_timeout_s property.

[impala]
# If QUERY_TIMEOUT_S > 0, the query will be timed out (i.e. cancelled) if Impala does not do any work (compute or send back results) for that query within QUERY_TIMEOUT_S seconds.
query_timeout_s=600

Until this version, the only alternative workaround to close all the queries, is to restart Hue (or Impala).

Note: Impala currently only cancels the query but does not close it. It will be improved in a future version with IMPALA-1575. In the meantime specify a -idle_session_timeout=20 in the Impala flags (“Command Line Argument Advanced Configuration Snippet (Safety Valve)”)

Hive

Hue never closes the Hive queries by default (as some queries can take hours of processing time). Also if your query volume is low (e.g. < a few hundreds a day) and you restart HiveServer2 every week, you are probably not affected. To get the same behavior as Impala (and close the query when the user leaves the page), switch on in the hue.ini:

[beeswax]
# Hue will try to close the Hive query when the user leaves the editor page.
# This will free all the query resources in HiveServer2, but also make its results inaccessible.
close_queries=true

Starting in CDH5 and CDH4.6 (with HiveServer2), some close_query and close_session commands were added to Hue.

build/env/bin/hue close_queries --help

Usage: build/env/bin/hue close_queries [options] &amp;amp;amp;amp;amp;amp;lt;age_in_days&amp;amp;amp;amp;amp;amp;gt;  (default is 7)

Closes the non running queries older than 7 days. If <all> is specified, close the ones of any types.   To run them while using Cloudera Manager, be sure to export these two environment variables:

export HUE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`"

./build/env/bin/hue close_queries 0
Closing (all=False) HiveServer2 queries older than 0 days...
1 queries closed.

./build/env/bin/hue close_sessions 0 hive
Closing (all=False) HiveServer2 sessions older than 0 days...
1 sessions closed.

You can then add this commands into a crontab and expire the queries older than N days.

Note

When using Kerberos you also need:

export HIVE_CONF_DIR="/var/run/cloudera-scm-agent/process/`ls -alrt /var/run/cloudera-scm-agent/process | grep HUE | tail -1 | awk '{print $9}'`/hive-conf"

A cleaner solution comes with HIVE-5799 (available in Hive 0.14 or C5.2). Like Impala, HiveServer2 can now automatically expires queries. So tweak hive-site.xml with:

&amp;amp;amp;amp;lt;property&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;name&amp;amp;amp;amp;gt;hive.server2.session.check.interval&amp;amp;amp;amp;lt;/name&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;value&amp;amp;amp;amp;gt;3000&amp;amp;amp;amp;lt;/value&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;description&amp;amp;amp;amp;gt;The check interval for session/operation timeout, which can be disabled by setting to zero or negative value.&amp;amp;amp;amp;lt;/description&amp;amp;amp;amp;gt;
&amp;amp;amp;amp;lt;/property&amp;amp;amp;amp;gt;

&amp;amp;amp;amp;lt;property&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;name&amp;amp;amp;amp;gt;hive.server2.idle.session.timeout&amp;amp;amp;amp;lt;/name&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;value&amp;amp;amp;amp;gt;0&amp;amp;amp;amp;lt;/value&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;description&amp;amp;amp;amp;gt;Session will be closed when it's not accessed for this duration, which can be disabled by setting to zero or negative value.&amp;amp;amp;amp;lt;/description&amp;amp;amp;amp;gt;
&amp;amp;amp;amp;lt;/property&amp;amp;amp;amp;gt;

&amp;amp;amp;amp;lt;property&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;name&amp;amp;amp;amp;gt;hive.server2.idle.operation.timeout&amp;amp;amp;amp;lt;/name&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;value&amp;amp;amp;amp;gt;0&amp;amp;amp;amp;lt;/value&amp;amp;amp;amp;gt;
 &amp;amp;amp;amp;lt;description&amp;amp;amp;amp;gt;Operation will be closed when it's not accessed for this duration of time, which can be disabled by setting to zero value. With positive value, it's checked for operations in terminal state only (FINISHED, CANCELED, CLOSED, ERROR). With negative value, it's checked for all of the operations regardless of state.&amp;amp;amp;amp;lt;/description&amp;amp;amp;amp;gt;
&amp;amp;amp;amp;lt;/property&amp;amp;amp;amp;gt;

Note

This is the recommended solution for Hive. User wishing to keep some result for longer can issue a CREATE TABLE AS SELECT … or export the results in Hue.

Sum-up

The query servers are becoming much more stable with these changes as their resources do not need to grow infinitely. One tradeoff though is that the user will lose his query results after a certain time. To make the experience better, several ideas are being explored, like automatically downloading N rows of the resultset and keeping them for longer.

As usual feel free to comment and send feedback on the hue-user list or @gethue!

Create Hive tables with headers and load quoted CSV data

Hue makes it easy to create Hive tables.

With HUE-1746, Hue guesses the columns names and types (int, string, float…) directly by looking at your data. If your data starts with a header, this one will automatically be used and skipped while creating the table.

 

 

Quoted CSV fields are also compatible thanks to HUE-1747.

Here is the data file used:

http://www.fdic.gov/bank/individual/failed/banklist.html

This is the SerDe for reading quoted CSV:

https://github.com/ogrodnek/csv-serde

And the command to switch the SerDe used by the table:

ALTER TABLE banks SET SERDE ‘com.bizo.hive.serde.csv.CSVSerde’

Now go analyze the data with the Hive, Impala or Pig editors!

Hive Query editor with HiveServer2 and Sentry

Hue provides a Web interface for submitting Hive queries. Hue had its own server to service Hive queries called Beeswax. The more sophisticated and robust service, Apache HiveServer2, is supported as of Hue 2.5.

 

 

Hive Editor

Thanks to HiveServer2 integration, Hue is getting the benefits from Sentry (How to configure Sentry Video). In addition to the security provided, Hue’s interface becomes more consistent. For example, a user without permissions on a database or table won’t see it in the query editor or in the Metastore app.

HiveServer2 also provides performant access to the Metastore.

On top of this, the Beeswax Hive UI is a Web editor for increasing the productivity:

  • Syntax highlighting and auto completion
  • Submit several queries and check they progress later
  • UDF integration
  • Multiple queries execution
  • Select and send a fraction of a query
  • Download or save the query results
  • Navigate through the metadata

Hue 3 does not bundle Beeswaxd anymore, and is configured by default to use HiveServer2. If HiveServer2 is not on the same machine as Hue update hue.ini with:

[beeswax] hive_server_host=<FQDN of HiveServer2>

Other Hive specific settings (e.g. security, impersonation) are read from a local /etc/hive/conf/hive-site.xml. We recommend to keep this one in exact sync with the original Hive one (or put Hue and Hive on the same machine).


Note
:

If you are using Hive 0.12 or later, Hue needs to have HUE-1561 (or use Hue 3.0 or later).

With Sentry

Hue will automatically work with a HiveServer2 configured with Sentry.

Notice that HiveServer2 impersonation (described below) should be turned off in case of Sentry. Permissions of the impersonated user (e.g. ‘bob’) will be used instead of the ones of the ‘hue’ user. Also we need the warehouse permissions to be owned by hive:hive with 770 so that only super users in hive group can read, write.

HiveServer2 needs to be using strong authentication like Kerberos/LDAP for Sentry to work.

Troubleshooting without Sentry

org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: org.apache.hadoop.security.AccessControlException Permission denied: user=hive, access=WRITE, inode=”/user/test/data”:test:supergroup:drwxr-xr-x

By default HiveServer2 now owns the Hive warehouse (default ‘/user/hive/warehouse’), meaning the data files need to belong to the ‘hive’ user. If you get this error when creating a table, change the permission of the data directory (here /user/test/data) to ‘write’ for everybody or revert HiveServer2 to the old Beeswax behavior by authorizing ‘hive’ to impersonate the user.

Adding ‘hive’ as a Hadoop proxy user and edit your hive-site.xml:

<property>   <name>hive.server2.enable.doAs</name>   <value>true</value> </property>

Then restart HiveServer2:

sudo service hive-server2 restart

Another common error when using YARN is:

Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

It means that the HADOOP_MAPRED_HOME environment variable is not set to:

export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

HADOOP_HOME could also be wrong.

TTransportException(‘Could not start SASL: Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found’,)

Hue is missing a SASL lib in your system.

HiveServer2 supports 3 authentication modes specified by the ‘hive.server2.authentication’ in hive-site.xml:

  • NOSASL
  • NONE (default)
  • KERBEROS

Only NOSASL does not require SASL, so you either switch to it or install the missing packages.

Hue will pick the value from its local /etc/hive/conf/hive-site.xml so make sure it is synced with the original hive-site.xml (manually or via CM Beeswax safety valve).

e.g.

<property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property>

Error while compiling statement: FAILED: RuntimeException org.apache.hadoop.security.AccessControlException: Permission denied: user=admin, access=WRITE, inode=”/tmp/hive-hive”:hive:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:234) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:214) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:158)

The Hive HDFS workspace ‘/tmp/hive-hive’ would need to be set to 1777 permissions.

Troubleshooting with Sentry

AuthorizationException: User ‘hue/test.com’ does not have privileges to execute ‘CREATE’ on: default.sample_08″

The user ‘hue’ is not configured in Sentry and have not the CREATE table permission.

UDF

Apache Hive comes with a lot of built-in UDFs, but what happens when you need a “special one”? This post is about how to get started with a custom Hive UDF from compilation to execution in no time.

 

Let’s go!

Our goal is to create a UDF that transforms its input to upper case. All the code is available in our public repository of Hadoop examples and tutorials.

If you want to go even faster, the UDF is already precompiled here.

If not, checkout the code:

git clone https://github.com/romainr/hadoop-tutorials-examples.git cd hive-udf

And compile the UDF (Java and Hive need to be installed):

javac -cp $(ls /usr/lib/hive/lib/hive-exec*.jar):/usr/lib/hadoop/hadoop-common.jar org/hue/udf/MyUpper.java jar -cf myudfs.jar -C . .

 

Or use Maven with our pom.xml that will automatically pull the dependent jars

mvn install

 

Register the UDF in the Hive Editor

Then open up Beeswax in the Hadoop UI Hue, click on the ‘Settings’ tab.

In File Resources, upload myudfs.jar, pick the jar file and point to it, e.g.:

1

/user/hue/myudf.jar

Make the UDF available by registering a UDF (User Defined Function ):

Name

1

myUpper

Class

1

org.hue.udf.MyUpper

 

That’s it! Just test it on one of the Hue example tables:

1

select myUpper(description) FROM sample_07 limit 10

Summary

We are using the most common type of UDF. If you want to learn more in depth about the other ones, some great resources like the Hadoop Definitive guide are available. Notice that adding a jar loads it for the entirety of the session so you don’t need to load it again. Next time we will demo how to create a Python UDF for Hive!

 

Have any questions? Feel free to contact us on hue-user or @gethue!

 

Note:

If you did not register the UDF as explained above, you will get this error:

1

error while compiling statement: failed: parseexception line 1:0 cannot recognize input near ‘myupper’ ” ”

Apache Hive comes with a lot of built-in UDFs, but what happens when you need a “special one”? This post is about how to get started with a custom Hive UDF from compilation to execution in no time.

 

 

Let’s go!

Our goal is to create a UDF that transforms its input to upper case. All the code is available in our public repository of Hadoop examples and tutorials.

If you want to go even faster, the UDF is already precompiled here.

If not, checkout the code:

git clone https://github.com/romainr/hadoop-tutorials-examples.git
cd hive-udf

And compile the UDF (Java and Hive need to be installed):

javac -cp $(ls /usr/lib/hive/lib/hive-exec*.jar):/usr/lib/hadoop/hadoop-common.jar org/hue/udf/MyUpper.java

jar -cf myudfs.jar  -C . .

 

Or use Maven with our pom.xml that will automatically pull the dependent jars

mvn install

 

Register the UDF in the Hive Editor

Then open up Beeswax in the Hadoop UI Hue, click on the ‘Settings’ tab.

In File Resources, upload myudfs.jar, pick the jar file and point to it, e.g.:

/user/hue/myudf.jar

Make the UDF available by registering a UDF (User Defined Function ):

Name

myUpper

Class

org.hue.udf.MyUpper

 

That’s it! Just test it on one of the Hue example tables:

select myUpper(description) FROM sample_07 limit 10

 

Summary

We are using the most common type of UDF. If you want to learn more in depth about the other ones, some great resources like the Hadoop Definitive guide are available. Notice that adding a jar loads it for the entirety of the session so you don’t need to load it again. Next time we will demo how to create a Python UDF for Hive!

 

Have any questions? Feel free to contact us on hue-user or @gethue!

 

Note:

If you did not register the UDF as explained above, you will get this error:

error while compiling statement: failed: parseexception line 1:0 cannot recognize input near 'myupper' '' ''