Easy indexing of data into Solr with ETL operations

Easy indexing of data into Solr with ETL operations

Creating Solr Collections from Data files in a few clicks

There are exciting new features coming in Hue 3.11 week and later in CDH 5.9 this Fall. One of which is Hue’s brand new tool to create Apache Solr Collections from file data. Hue’s Solr dashboards are great for visualizing and learning more about your data so being able to easily load data into Solr collections can be really useful.

In the past, indexing data into Solr has been quite difficult. The task involved writing a Solr schema and a morphlines file then submitting a job to YARN to do the indexing. Often times getting this correct for non trivial imports could take a few days of work. Now with Hue’s new feature you can start your YARN indexing job in minutes. This tutorial offers a step by step guide on how to do it.




What you’ll need

First you’ll need to have a running Solr cluster that Hue is configured with.

Next you’ll need to install these required libraries. To do so place them in a directory somewhere on HDFS and set the path for config_indexer_libs_path under indexer in the Hue ini to match by default, the config_indexer_libs_path value is set to /tmp/smart_indexer_lib. Additionally under indexer in the Hue ini you’ll need to set enable_new_indexer to true.


# Flag to turn on the morphline based Solr indexer.

# Oozie workspace template for indexing.
## config_indexer_libs_path=/tmp/smart_indexer_lib


If using Cloudera Manager, check how to add properties in Hue.ini safety valve and put the abov

Selecting data

We are going to create a new Solr collection from business review data. To start let’s put the data file somewhere on HDFS so we can access it.



Now we can get started! Under the search tab in the navbar select Index.



We’ll pick a name for our new collection and select our reviews data file from HDFS. Then we’ll click next.


Field selection and ETL

On this tab we can see all the fields the indexer has picked up from the file. Note that Hue has also made an educated guess on the field type. Generally, Hue does a good job inferring data type. However, we should do a quick check to confirm that the field types look correct.



For our data we’re going to perform 4 operations to make a very searchable Solr Collection.

  1. Convert Date
    This operation is implicit. By setting the field type to date we inform Hue that we want to convert this date to a Solr Date. Hue can convert most standard date formats automatically. If we had a unique date format we would have to define it for Hue by explicitly using the convert date operation.
  1. Translate star ratings to integer ratings
    Under the rating field we’ll change the field type from string to long and click add operation. We’ll then select the translate operation and setup the following translation mapping.
  1. Grok the city from the full address field
    We’ll add a grok operation to the full address field, fill in the following regex .* (?<city>\w+),.* and set the number of expected fields to 1. In the new child field we’ll set the name to city. This new field will now contain the value matching the city capture group in the regex.
  1. Use a split operation to separate the latitude/longitude field into two separate floating point fields.
    Here we have an annoyance. Our data file contains the latitude and longitude of the place that’s being reviewed – Awesome! For some reason though they’ve been clumped into one field with a comma between the two numbers. We’ll use a split operation to grab each individually. Set the split value to ‘,’ and the number of output fields to 2. Then change the child fields’ types to doubles and give them logical names. In this case there’s not really much sense in keeping the parent field so let’s uncheck the “keep in index” box.
  1. perform a GeoIP to find where the user was when they submitted the review
    Here we’ll add a geo ip operation and select iso_code as our output. This will give us the country code.



Before we index, let’s make sure everything looks good with a quick scan of the preview. This can be handy to avoid any silly typos or anything like that.

Now that we’ve defined our ETL Hue can do the rest. Click index and wait for Hue to index our data. At the bottom of this screen we can see a progress bar of the process. Yellow means our data is currently being indexed and green means it’s done. Feel free to close this window. The indexing will continue on your cluster.

Once our data has been indexed into a Solr Collection we have access to all of Hue’s search features and can make a nice analytics dashboard like this one for our data.




Assembling the lib directory yourself

The indexer libs path is where all required libraries for indexing should be. If you’d prefer you can assemble this directory yourself. There are three main components to the libs directory:

1. JAR files required by the MapReduceIndexerTool

  • All required jar files should have shipped with CDH. Currently the list of required JARs is:
    • argparse4j-0.4.3.jar
    • readme.txt
    • httpmime-4.2.5.jar
    • search-mr-1.0.0-cdh5.8.0-job.jar
    • kite-morphlines-core-1.0.0-cdh5.8.0.jar
    • solr-core-4.10.3-cdh5.8.0.jar
    • kite-morphlines-solr-core-1.0.0-cdh5.8.0.jar
    • solr-solrj-4.10.3-cdh5.8.0.jar
    • noggit-0.5.jar
  • Should this change and you get a missing class error, you can find whatever jar may be missing by grepping all the jars packaged with CDH for the missing class.

2. Maxmind GeoLite2 database

3. Grok Dictionaries

  • Any grok commands can be defined in text files within the grok_dictionaries sub directory. A good starter set of grok dictionaries can be found here.



On top of the ease of use, this is where the real power of Hue’s new indexer lies. Heavily leveraging Morphlines, operations let us easily transform our data into a more searchable format. Before we add some to our fields let’s quickly go over the operations that the indexer offers.

Operation list:

  • Split
    • With the split operation we can take a field and produce new fields by splitting the original field on a delimiting character
    • Input: “2.1,-3.5,7.1”
      Split Character: “,”
    • Outputs 3 fields:
      Field 1: “2.1”
      Field 2: “-3.5”
      Field 3: “7.1”
  • Grok
    • Grok is an extension of Regex and can be used to match specific subsections of a field and pull them out. You can read more about the Grok syntax here
    • Input: “Winnipeg (Canada)”
      Regular Expression: “\w+ \((?<country>\w+)\)”
    • Outputs 1 field:
      country: “Canada”
  • Convert Date
    • Generally the indexer converts dates automatically to Solr’s native format. However, if you have a very obscure date format you can define it using a SimpleDateFormat here to ensure it is converted correctly
    • Input: “Aug (2016) 24”
      Date Format: “MMM (YYYY) dd”
    • Output: In place replacement: “2016-08-24T00:00:00Z”
  • Extract URI Components
    • Extract URI Components lets you grab specific parts of a URI and put it in its own field without having to write the Regex yourself.
    • The following components can be extracted:
      • Authority
      • Fragment
      • Host
      • Path
      • Port
      • Query
      • Scheme
      • Scheme Specific Path
      • User Info
    • Input: “https://www.google.com/#q=cloudera+hue
      Selected: Host
    • Output: “www.google.com”
  • Geo IP
    • Geo IP performs a Maxmind GeoIP lookup to match public IP addresses with a location.
    • The following location information can be extracted with this operation:
      • ISO Code
      • Country Name
      • Subdivision Names
      • Subdivision ISO Code
      • City Name
      • Postal Code
      • Latitude
      • Longitude
    • Input: “”
      Selected: ISO Code, City Name, Latitude, Longitude
    • Output: “US”,  “Palo Alto”, “37.3762”, “-122.1826”
  • Translate
    • Translate will take given hard coded values and replace them with set values in place.
    • Input: “Five Stars”

      “Five Stars” -> “5”
      “Four Stars” -> “4”
      “Three Stars” -> “3”
      “Two Stars” -> “2”
      “One Star” -> “1”
    • Output: In place Replacement: “5”
  • Find and Replace
    • Find and Replace takes a Grok string as the find argument and will replace all matches in the field with the specified replace value in place.
    • Input: “Hello World”
      Find: “(?<word>\b\w+\b)”
      Replace: “”${word}!”
    • Output: In place replacement: “Hello! World!”


Supported Input Data

Hue successfully recognized our file as a CSV. The indexer currently supports the following file types:

  •  CSV Files
  •  Hue Log Files
  •  Combined Apache Log Files
  •  Ruby Log File
  •  Syslog

Beyond files, metastore tables and Hive SQL queries are also supported. Read more about these in an upcoming 3.11 blog post.



During the indexing process records can be dropped if they fail to match the Solr Schema. (e.g., trying to place a string into a long field). If some of your records are missing and you are unsure why you can always check the mapper log for the indexing job to get a better idea on what’s going on.



  1. Victor 3 years ago

    Hello there,

    This tutorial only works for the 3.11 version? Any possibility to use in 3.9?



    • Author
      Hue Team 3 years ago

      Yes, only 3.11 as it is new there

      • Victor 3 years ago

        This version is already available for download or update my hue version?

        I’m looking in Download Area but I couldn’t find it.

        Thanks for reply Team 🙂

  2. yujin 2 years ago

    I follow the instructions ,edit the hue.ini and put libs file on the hdfs


    hdfs also have the file

    +index page alredy exist ,but where i choose a file , the Add Operation can’t work , submit the job ,there alredy correct bulid the collection

    net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper:,,
    org.apache.oozie.action.hadoop.JavaMainException: net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper:,,
    at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
    at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:238)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:180)

    • Author
      Hue Team 2 years ago

      Which Solr are you using? The indexer was only tested with CDH as it requires Morphline.

      • yujin 2 years ago


        jars libs are Solr 4.10.3 -cdh but the real Solr is 6.2.0(Installed on the cluster)

        the other Operation just follow the instructions ,edit the hue.ini and put libs file on the hdfs

        set the index
        ” + Add Operation” can’t work, it clicks but no response

        there also have The following error
        org.apache.oozie.action.hadoop.JavaMainException: net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper:
        at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)
        at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
        at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        • Author
          Hue Team 2 years ago

          Right now the indexer was tested only with Solr 4 (and it belongs to Cloudera Search which is compiled with Solr4), but we are going to port it to Solr 6 soon.

  3. Ashish Tyagi 2 years ago

    Hi Hue Team,

    I am running hue 3.11 with solr 4.10.2 and external zookeeper. I followed exact steps mentioned in this tutorial but when proceed to the 3rd step(Index it). I got below error. However, I installed the sample solr application like twitter_demo, yelp demo are working fine. Please help !!!

    [20/Jan/2017 13:01:29 -0800] conf ERROR failed to get zookeeper ensemble
    Traceback (most recent call last):
    File “/usr/local/hue/desktop/libs/indexer/src/indexer/conf.py”, line 48, in zkensemble
    clusters = CLUSTERS.get()
    AttributeError: ‘UnspecifiedConfigSection’ object has no attribute ‘get’

    [20/Jan/2017 13:01:31 -0800] api ERROR Could not create collection. Check response:
    “failure”: {
    “”: “org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore ‘test_online_rev_shard1_replica1’: Unable to create core [test_online_rev_shard1_replica1] Caused by: Specified config does not exist in ZooKeeper:test_online_rev”
    “responseHeader”: {
    “status”: 0,
    “QTime”: 1158

  4. Piotr 2 years ago

    Hello Hue Team!
    I want to try new indexer using Hue 3.11 on top of CDH-5.10(parcels).
    The thing is that CDH after each hue’s restart generates hue.ini from a template and I can’t enable ‘enable_new_indexer=true’.
    Could you guide me where is the source template located/how is it called?

    I see only this:
    $ sudo find /var/run/cloudera-scm-agent -name ‘hue.ini’

    $ sudo find /opt/cloudera/ -name ‘hue.ini’

    $ grep enable_new_indexer /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/etc/hue/conf.empty/hue.ini


    • Author
      Hue Team 2 years ago

      Added a note about how to configure with CM!

  5. Piotr 2 years ago

    Hi Hue Team!
    I’m playing around with Solr Search in Hue 3.11 on top of CDH-5.10(parcels).
    AFAIK I can’t create new field during indexing so I did a custom Solr query to calculate required values.
    My question: I see “query definitions” button in right upper corner in hue search dashboard – can I use it to add my custom Solr’s query? If yes – how?
    If no – is there a way to make my calculated values available during dashboard’s creation?
    The query I use:


    • Author
      Hue Team 2 years ago

      The query definition is to persist the state of the dashboard (which widgets you clicked, which query string you entered)

  6. Giridhar 2 years ago

    Hi , Is it possible to dowload/export the Solr query results from HUE ?

  7. Giridhar 2 years ago

    The download will give 1000 rows by default , what about getting the some rows in between as in Solr query

    • Author
      Hue Team 2 years ago

      Like 100 rows starting from row 10 000? What is the use case to about not needing the top rows?

  8. thomas 2 years ago

    In the Fields selection, why I do not have add opertation?

  9. JAC 2 years ago


    How could I make this as a real-time dash?

  10. chetan 2 years ago

    Hi All,

    Got below error when tying to index parquet table, could you please help on this.

    “Error: org.kitesdk.morphline.api.MorphlineRuntimeException: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.kitesdk.morphline.base.FaultTolerance.handleException(FaultTolerance.java:73) at org.apache.solr.hadoop.morphline.MorphlineMapRunner.map(MorphlineMapRunner.java:220) at org.apache.solr.hadoop.morphline.MorphlineMapper.map(MorphlineMapper.java:86) at org.apache.solr.hadoop.morphline.MorphlineMapper.map(MorphlineMapper.java:54) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.IllegalArgumentException: INT96 not yet implemented. at parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:252) at parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:237) at parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:222) at parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:236) at parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:216) at parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:210) at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:123) at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:178) at parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152) at parquet.hadoop.ParquetReader.read(ParquetReader.java:128) at org.kitesdk.morphline.hadoop.parquet.avro.ReadAvroParquetFileBuilder$ReadAvroParquetFile.doProcess(ReadAvroParquetFileBuilder.java:172) at org.kitesdk.morphline.base.AbstractCommand.process(AbstractCommand.java:161) at org.kitesdk.morphline.base.Connector.process(Connector.java:64) at org.kitesdk.morphline.base.AbstractCommand.doProcess(AbstractCommand.java:186) at org.kitesdk.morphline.base.AbstractCommand.process(AbstractCommand.java:161) at org.kitesdk.morphline.base.AbstractCommand.doProcess(AbstractCommand.java:186) at org.kitesdk.morphline.base.AbstractCommand.process(AbstractCommand.java:161) at org.apache.solr.hadoop.morphline.MorphlineMapRunner.map(MorphlineMapRunner.java:208)”

  11. KAUSHIK 1 year ago

    Is the Cluster kerberized?

  12. Got below error while Indexing data in Kerberized Cluster. Could you please help on this
    ERROR org.apache.solr.hadoop.GoLive – Error sending live merge command HTTP Status 401 – Authentication required

    It worked when we tried to index data in non-Kerberized cluster.

    • Kedar baratam 1 year ago

      Even i got the same error as kaushik. Any solution when using kerberized Cluster.

  13. vira patel 9 months ago

    Is it possible to create index in all csv files contained in a folder? How can I do this using hue and solr?

    • Author
      Hue Team 9 months ago

      It is supported, pointing to the folder instead of an individual file should work.

  14. virenadar singh 3 months ago

    Hue 4.2, while create index everything works fine, but on 4.3 instead of index its creating Table. … ( using solr 7.3)

    • Author
      Hue Team 2 months ago

      We will check: and are you talking about the morphline indexer or the basic small file indexing?

Leave a reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.