Easy indexing of data into Solr with ETL operations

Easy indexing of data into Solr with ETL operations

Creating Solr Collections from Data files in a few clicks

There are exciting new features coming in Hue 3.11 week and later in CDH 5.9 this Fall. One of which is Hue’s brand new tool to create Apache Solr Collections from file data. Hue’s Solr dashboards are great for visualizing and learning more about your data so being able to easily load data into Solr collections can be really useful.

In the past, indexing data into Solr has been quite difficult. The task involved writing a Solr schema and a morphlines file then submitting a job to YARN to do the indexing. Often times getting this correct for non trivial imports could take a few days of work. Now with Hue’s new feature you can start your YARN indexing job in minutes. This tutorial offers a step by step guide on how to do it.

 

 

Tutorial

What you’ll need

First you’ll need to have a running Solr cluster that Hue is configured with.

Next you’ll need to install these required libraries. To do so place them in a directory somewhere on HDFS and set the path for config_indexer_libs_path under indexer in the Hue ini to match by default, the config_indexer_libs_path value is set to /tmp/smart_indexer_lib. Additionally under indexer in the Hue ini you’ll need to set enable_new_indexer to true.


[indexer]

# Flag to turn on the morphline based Solr indexer.
enable_new_indexer=false

# Oozie workspace template for indexing.
## config_indexer_libs_path=/tmp/smart_indexer_lib

Note:

If using Cloudera Manager, check how to add properties in Hue.ini safety valve and put the abov

Selecting data

We are going to create a new Solr collection from business review data. To start let’s put the data file somewhere on HDFS so we can access it.

data-file-indexer

 

Now we can get started! Under the search tab in the navbar select Index.

indxer-menu

 

We’ll pick a name for our new collection and select our reviews data file from HDFS. Then we’ll click next.

indexer-wizard

Field selection and ETL

On this tab we can see all the fields the indexer has picked up from the file. Note that Hue has also made an educated guess on the field type. Generally, Hue does a good job inferring data type. However, we should do a quick check to confirm that the field types look correct.

indexer-wizard-fields

 

For our data we’re going to perform 4 operations to make a very searchable Solr Collection.

  1. Convert Date
    This operation is implicit. By setting the field type to date we inform Hue that we want to convert this date to a Solr Date. Hue can convert most standard date formats automatically. If we had a unique date format we would have to define it for Hue by explicitly using the convert date operation.
    indexer-op-date
  1. Translate star ratings to integer ratings
    Under the rating field we’ll change the field type from string to long and click add operation. We’ll then select the translate operation and setup the following translation mapping.
    indexer-translate-date
  1. Grok the city from the full address field
    We’ll add a grok operation to the full address field, fill in the following regex .* (?<city>\w+),.* and set the number of expected fields to 1. In the new child field we’ll set the name to city. This new field will now contain the value matching the city capture group in the regex.
    indexer-op-grok
  1. Use a split operation to separate the latitude/longitude field into two separate floating point fields.
    Here we have an annoyance. Our data file contains the latitude and longitude of the place that’s being reviewed – Awesome! For some reason though they’ve been clumped into one field with a comma between the two numbers. We’ll use a split operation to grab each individually. Set the split value to ‘,’ and the number of output fields to 2. Then change the child fields’ types to doubles and give them logical names. In this case there’s not really much sense in keeping the parent field so let’s uncheck the “keep in index” box.
    indexer-op-split
  1. perform a GeoIP to find where the user was when they submitted the review
    Here we’ll add a geo ip operation and select iso_code as our output. This will give us the country code.
    indexer-op-geoip

 

Indexing

Before we index, let’s make sure everything looks good with a quick scan of the preview. This can be handy to avoid any silly typos or anything like that.

Now that we’ve defined our ETL Hue can do the rest. Click index and wait for Hue to index our data. At the bottom of this screen we can see a progress bar of the process. Yellow means our data is currently being indexed and green means it’s done. Feel free to close this window. The indexing will continue on your cluster.

Once our data has been indexed into a Solr Collection we have access to all of Hue’s search features and can make a nice analytics dashboard like this one for our data.

indexer-dash

 

Documentation

Assembling the lib directory yourself

The indexer libs path is where all required libraries for indexing should be. If you’d prefer you can assemble this directory yourself. There are three main components to the libs directory:

1. JAR files required by the MapReduceIndexerTool

  • All required jar files should have shipped with CDH. Currently the list of required JARs is:
    • argparse4j-0.4.3.jar
    • readme.txt
    • httpmime-4.2.5.jar
    • search-mr-1.0.0-cdh5.8.0-job.jar
    • kite-morphlines-core-1.0.0-cdh5.8.0.jar
    • solr-core-4.10.3-cdh5.8.0.jar
    • kite-morphlines-solr-core-1.0.0-cdh5.8.0.jar
    • solr-solrj-4.10.3-cdh5.8.0.jar
    • noggit-0.5.jar
  • Should this change and you get a missing class error, you can find whatever jar may be missing by grepping all the jars packaged with CDH for the missing class.

2. Maxmind GeoLite2 database

3. Grok Dictionaries

  • Any grok commands can be defined in text files within the grok_dictionaries sub directory. A good starter set of grok dictionaries can be found here.

 

Operations

On top of the ease of use, this is where the real power of Hue’s new indexer lies. Heavily leveraging Morphlines, operations let us easily transform our data into a more searchable format. Before we add some to our fields let’s quickly go over the operations that the indexer offers.

Operation list:

  • Split
    • With the split operation we can take a field and produce new fields by splitting the original field on a delimiting character
    • Input: “2.1,-3.5,7.1”
      Split Character: “,”
    • Outputs 3 fields:
      Field 1: “2.1”
      Field 2: “-3.5”
      Field 3: “7.1”
  • Grok
    • Grok is an extension of Regex and can be used to match specific subsections of a field and pull them out. You can read more about the Grok syntax here
    • Input: “Winnipeg (Canada)”
      Regular Expression: “\w+ \((?<country>\w+)\)”
    • Outputs 1 field:
      country: “Canada”
  • Convert Date
    • Generally the indexer converts dates automatically to Solr’s native format. However, if you have a very obscure date format you can define it using a SimpleDateFormat here to ensure it is converted correctly
    • Input: “Aug (2016) 24”
      Date Format: “MMM (YYYY) dd”
    • Output: In place replacement: “2016-08-24T00:00:00Z”
  • Extract URI Components
    • Extract URI Components lets you grab specific parts of a URI and put it in its own field without having to write the Regex yourself.
    • The following components can be extracted:
      • Authority
      • Fragment
      • Host
      • Path
      • Port
      • Query
      • Scheme
      • Scheme Specific Path
      • User Info
    • Input: “https://www.google.com/#q=cloudera+hue
      Selected: Host
    • Output: “www.google.com”
  • Geo IP
    • Geo IP performs a Maxmind GeoIP lookup to match public IP addresses with a location.
    • The following location information can be extracted with this operation:
      • ISO Code
      • Country Name
      • Subdivision Names
      • Subdivision ISO Code
      • City Name
      • Postal Code
      • Latitude
      • Longitude
    • Input: “74.217.76.101”
      Selected: ISO Code, City Name, Latitude, Longitude
    • Output: “US”,  “Palo Alto”, “37.3762”, “-122.1826”
  • Translate
    • Translate will take given hard coded values and replace them with set values in place.
    • Input: “Five Stars”

      Mapping:
      “Five Stars” -> “5”
      “Four Stars” -> “4”
      “Three Stars” -> “3”
      “Two Stars” -> “2”
      “One Star” -> “1”
    • Output: In place Replacement: “5”
  • Find and Replace
    • Find and Replace takes a Grok string as the find argument and will replace all matches in the field with the specified replace value in place.
    • Input: “Hello World”
      Find: “(?<word>\b\w+\b)”
      Replace: “”${word}!”
    • Output: In place replacement: “Hello! World!”

 

Supported Input Data

Hue successfully recognized our file as a CSV. The indexer currently supports the following file types:

  •  CSV Files
  •  Hue Log Files
  •  Combined Apache Log Files
  •  Ruby Log File
  •  Syslog

Beyond files, metastore tables and Hive SQL queries are also supported. Read more about these in an upcoming 3.11 blog post.

 

Troubleshooting

During the indexing process records can be dropped if they fail to match the Solr Schema. (e.g., trying to place a string into a long field). If some of your records are missing and you are unsure why you can always check the mapper log for the indexing job to get a better idea on what’s going on.

 

14 Comments

  1. Victor 7 months ago

    Hello there,

    This tutorial only works for the 3.11 version? Any possibility to use in 3.9?

    Thanks.

    Victor

    • Author
      Hue Team 7 months ago

      Yes, only 3.11 as it is new there

  2. yujin 4 months ago

    hello
    I follow the instructions ,edit the hue.ini and put libs file on the hdfs

    hue.ini
    [indexer]
    config_indexer_libs_path=”/tmp/smart_indexer_lib”
    enable_new_indexer=true

    hdfs also have the file

    +index page alredy exist ,but where i choose a file , the Add Operation can’t work , submit the job ,there alredy correct bulid the collection

    net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper: 8.5.185.5:24002,8.5.185.2:24002,8.5.185.1:24002/solr
    org.apache.oozie.action.hadoop.JavaMainException: net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper: 8.5.185.5:24002,8.5.185.2:24002,8.5.185.1:24002/solr
    at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)
    at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
    at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:238)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:180)

    • Author
      Hue Team 4 months ago

      Which Solr are you using? The indexer was only tested with CDH as it requires Morphline.

      • yujin 4 months ago

        hello.

        jars libs are Solr 4.10.3 -cdh but the real Solr is 6.2.0(Installed on the cluster)

        the other Operation just follow the instructions ,edit the hue.ini and put libs file on the hdfs

        set the index
        enable_new_indexer=true
        ” + Add Operation” can’t work, it clicks but no response

        there also have The following error
        org.apache.oozie.action.hadoop.JavaMainException: net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper:
        at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)
        at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
        at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        • Author
          Hue Team 4 months ago

          Right now the indexer was tested only with Solr 4 (and it belongs to Cloudera Search which is compiled with Solr4), but we are going to port it to Solr 6 soon.

  3. Ashish Tyagi 2 months ago

    Hi Hue Team,

    I am running hue 3.11 with solr 4.10.2 and external zookeeper. I followed exact steps mentioned in this tutorial but when proceed to the 3rd step(Index it). I got below error. However, I installed the sample solr application like twitter_demo, yelp demo are working fine. Please help !!!

    [20/Jan/2017 13:01:29 -0800] conf ERROR failed to get zookeeper ensemble
    Traceback (most recent call last):
    File “/usr/local/hue/desktop/libs/indexer/src/indexer/conf.py”, line 48, in zkensemble
    clusters = CLUSTERS.get()
    AttributeError: ‘UnspecifiedConfigSection’ object has no attribute ‘get’

    [20/Jan/2017 13:01:31 -0800] api ERROR Could not create collection. Check response:
    {
    “failure”: {
    “”: “org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore ‘test_online_rev_shard1_replica1’: Unable to create core [test_online_rev_shard1_replica1] Caused by: Specified config does not exist in ZooKeeper:test_online_rev”
    },
    “responseHeader”: {
    “status”: 0,
    “QTime”: 1158
    }
    }

  4. Piotr 4 weeks ago

    Hello Hue Team!
    I want to try new indexer using Hue 3.11 on top of CDH-5.10(parcels).
    The thing is that CDH after each hue’s restart generates hue.ini from a template and I can’t enable ‘enable_new_indexer=true’.
    Could you guide me where is the source template located/how is it called?

    I see only this:
    $ sudo find /var/run/cloudera-scm-agent -name ‘hue.ini’
    /var/run/cloudera-scm-agent/process/116-hue-HUE_SERVER/hue.ini
    /var/run/cloudera-scm-agent/process/115-hue-HUE_SERVER/hue.ini
    /var/run/cloudera-scm-agent/process/114-hue-HUE_SERVER/hue.ini

    $ sudo find /opt/cloudera/ -name ‘hue.ini’
    /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/etc/hue/conf.empty/hue.ini

    $ grep enable_new_indexer /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/etc/hue/conf.empty/hue.ini
    enable_new_indexer=true

    Thanks!

    • Author
      Hue Team 3 weeks ago

      Added a note about how to configure with CM!

Leave a reply

Your email address will not be published. Required fields are marked *

*