Creating Solr Collections from Data files in a few clicks
There are exciting new features coming in Hue 3.11 week and later in CDH 5.9 this Fall. One of which is Hue’s brand new tool to create Apache Solr Collections from file data. Hue’s Solr dashboards are great for visualizing and learning more about your data so being able to easily load data into Solr collections can be really useful.
In the past, indexing data into Solr has been quite difficult. The task involved writing a Solr schema and a morphlines file then submitting a job to YARN to do the indexing. Often times getting this correct for non trivial imports could take a few days of work. Now with Hue’s new feature you can start your YARN indexing job in minutes. This tutorial offers a step by step guide on how to do it.
Tutorial
What you’ll need
First you’ll need to have a running Solr cluster that Hue is configured with.
Next you’ll need to install these required libraries. To do so place them in a directory somewhere on HDFS and set the path for config_indexer_libs_path under indexer in the Hue ini to match by default, the config_indexer_libs_path value is set to /tmp/smart_indexer_lib. Additionally under indexer in the Hue ini you’ll need to set enable_new_indexer to true.
[indexer] # Flag to turn on the morphline based Solr indexer. enable_new_indexer=false # Oozie workspace template for indexing. ## config_indexer_libs_path=/tmp/smart_indexer_lib
Note:
If using Cloudera Manager, check how to add properties in Hue.ini safety valve and put the abov
Selecting data
We are going to create a new Solr collection from business review data. To start let’s put the data file somewhere on HDFS so we can access it.
Now we can get started! Under the search tab in the navbar select Index.
We’ll pick a name for our new collection and select our reviews data file from HDFS. Then we’ll click next.
Field selection and ETL
On this tab we can see all the fields the indexer has picked up from the file. Note that Hue has also made an educated guess on the field type. Generally, Hue does a good job inferring data type. However, we should do a quick check to confirm that the field types look correct.
For our data we’re going to perform 4 operations to make a very searchable Solr Collection.
- Convert Date
This operation is implicit. By setting the field type to date we inform Hue that we want to convert this date to a Solr Date. Hue can convert most standard date formats automatically. If we had a unique date format we would have to define it for Hue by explicitly using the convert date operation.
- Translate star ratings to integer ratings
Under the rating field we’ll change the field type from string to long and click add operation. We’ll then select the translate operation and setup the following translation mapping.
- Grok the city from the full address field
We’ll add a grok operation to the full address field, fill in the following regex .* (?<city>\w+),.* and set the number of expected fields to 1. In the new child field we’ll set the name to city. This new field will now contain the value matching the city capture group in the regex.
- Use a split operation to separate the latitude/longitude field into two separate floating point fields.
Here we have an annoyance. Our data file contains the latitude and longitude of the place that’s being reviewed – Awesome! For some reason though they’ve been clumped into one field with a comma between the two numbers. We’ll use a split operation to grab each individually. Set the split value to ‘,’ and the number of output fields to 2. Then change the child fields’ types to doubles and give them logical names. In this case there’s not really much sense in keeping the parent field so let’s uncheck the “keep in index” box.
- perform a GeoIP to find where the user was when they submitted the review
Here we’ll add a geo ip operation and select iso_code as our output. This will give us the country code.
Indexing
Before we index, let’s make sure everything looks good with a quick scan of the preview. This can be handy to avoid any silly typos or anything like that.
Now that we’ve defined our ETL Hue can do the rest. Click index and wait for Hue to index our data. At the bottom of this screen we can see a progress bar of the process. Yellow means our data is currently being indexed and green means it’s done. Feel free to close this window. The indexing will continue on your cluster.
Once our data has been indexed into a Solr Collection we have access to all of Hue’s search features and can make a nice analytics dashboard like this one for our data.
Documentation
Assembling the lib directory yourself
The indexer libs path is where all required libraries for indexing should be. If you’d prefer you can assemble this directory yourself. There are three main components to the libs directory:
1. JAR files required by the MapReduceIndexerTool
- All required jar files should have shipped with CDH. Currently the list of required JARs is:
- argparse4j-0.4.3.jar
- readme.txt
- httpmime-4.2.5.jar
- search-mr-1.0.0-cdh5.8.0-job.jar
- kite-morphlines-core-1.0.0-cdh5.8.0.jar
- solr-core-4.10.3-cdh5.8.0.jar
- kite-morphlines-solr-core-1.0.0-cdh5.8.0.jar
- solr-solrj-4.10.3-cdh5.8.0.jar
- noggit-0.5.jar
- Should this change and you get a missing class error, you can find whatever jar may be missing by grepping all the jars packaged with CDH for the missing class.
2. Maxmind GeoLite2 database
- This file is required for the GeoIP lookup command and can be found on MaxMind’s website
3. Grok Dictionaries
- Any grok commands can be defined in text files within the grok_dictionaries sub directory. A good starter set of grok dictionaries can be found here.
Operations
On top of the ease of use, this is where the real power of Hue’s new indexer lies. Heavily leveraging Morphlines, operations let us easily transform our data into a more searchable format. Before we add some to our fields let’s quickly go over the operations that the indexer offers.
Operation list:
- Split
- With the split operation we can take a field and produce new fields by splitting the original field on a delimiting character
- Input: “2.1,-3.5,7.1”
Split Character: “,” - Outputs 3 fields:
Field 1: “2.1”
Field 2: “-3.5”
Field 3: “7.1”
- Grok
- Grok is an extension of Regex and can be used to match specific subsections of a field and pull them out. You can read more about the Grok syntax here
- Input: “Winnipeg (Canada)”
Regular Expression: “\w+ \((?<country>\w+)\)” - Outputs 1 field:
country: “Canada”
- Convert Date
- Generally the indexer converts dates automatically to Solr’s native format. However, if you have a very obscure date format you can define it using a SimpleDateFormat here to ensure it is converted correctly
- Input: “Aug (2016) 24”
Date Format: “MMM (YYYY) dd” - Output: In place replacement: “2016-08-24T00:00:00Z”
- Extract URI Components
- Extract URI Components lets you grab specific parts of a URI and put it in its own field without having to write the Regex yourself.
- The following components can be extracted:
- Authority
- Fragment
- Host
- Path
- Port
- Query
- Scheme
- Scheme Specific Path
- User Info
- Input: “https://www.google.com/#q=cloudera+hue”
Selected: Host - Output: “www.google.com”
- Geo IP
- Geo IP performs a Maxmind GeoIP lookup to match public IP addresses with a location.
- The following location information can be extracted with this operation:
- ISO Code
- Country Name
- Subdivision Names
- Subdivision ISO Code
- City Name
- Postal Code
- Latitude
- Longitude
- Input: “74.217.76.101”
Selected: ISO Code, City Name, Latitude, Longitude - Output: “US”, “Palo Alto”, “37.3762”, “-122.1826”
- Translate
- Translate will take given hard coded values and replace them with set values in place.
- Input: “Five Stars”
Mapping:
“Five Stars” -> “5”
“Four Stars” -> “4”
“Three Stars” -> “3”
“Two Stars” -> “2”
“One Star” -> “1” - Output: In place Replacement: “5”
- Find and Replace
- Find and Replace takes a Grok string as the find argument and will replace all matches in the field with the specified replace value in place.
- Input: “Hello World”
Find: “(?<word>\b\w+\b)”
Replace: “”${word}!” - Output: In place replacement: “Hello! World!”
Supported Input Data
Hue successfully recognized our file as a CSV. The indexer currently supports the following file types:
- CSV Files
- Hue Log Files
- Combined Apache Log Files
- Ruby Log File
- Syslog
Beyond files, metastore tables and Hive SQL queries are also supported. Read more about these in an upcoming 3.11 blog post.
Troubleshooting
During the indexing process records can be dropped if they fail to match the Solr Schema. (e.g., trying to place a string into a long field). If some of your records are missing and you are unsure why you can always check the mapper log for the indexing job to get a better idea on what’s going on.
30 Comments
-
Hello there,
This tutorial only works for the 3.11 version? Any possibility to use in 3.9?
Thanks.
Victor
-
Author
Yes, only 3.11 as it is new there
-
This version is already available for download or update my hue version?
I’m looking in Download Area but I couldn’t find it.
Thanks for reply Team 🙂
-
Author
The URL is http://gethue.com/downloads/releases/3.11.0/hue-3.11.0.tgz, we are bit lagging on the official blog post as there is a lot to talk about 🙂
-
Thank you!
One more question and I promise that’s all 🙂
Where can I find the tutorial to install this version?
Thanks again for your support.
-
Author
The release has a README about how to install it. If not, just use master and http://gethue.com/how-to-build-hue-on-ubuntu-14-04-trusty/
-
-
-
-
-
hello
I follow the instructions ,edit the hue.ini and put libs file on the hdfshue.ini
[indexer]
config_indexer_libs_path=”/tmp/smart_indexer_lib”
enable_new_indexer=truehdfs also have the file
+index page alredy exist ,but where i choose a file , the Add Operation can’t work , submit the job ,there alredy correct bulid the collection
net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper: 8.5.185.5:24002,8.5.185.2:24002,8.5.185.1:24002/solr
org.apache.oozie.action.hadoop.JavaMainException: net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper: 8.5.185.5:24002,8.5.185.2:24002,8.5.185.1:24002/solr
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:238)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:180)-
Author
Which Solr are you using? The indexer was only tested with CDH as it requires Morphline.
-
hello.
jars libs are Solr 4.10.3 -cdh but the real Solr is 6.2.0(Installed on the cluster)
the other Operation just follow the instructions ,edit the hue.ini and put libs file on the hdfs
set the index
enable_new_indexer=true
” + Add Operation” can’t work, it clicks but no responsethere also have The following error
org.apache.oozie.action.hadoop.JavaMainException: net.sourceforge.argparse4j.inf.ArgumentParserException: java.lang.IllegalArgumentException: Cannot find collection ‘etl5’ in ZooKeeper:
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:59)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:47)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)-
Author
Right now the indexer was tested only with Solr 4 (and it belongs to Cloudera Search which is compiled with Solr4), but we are going to port it to Solr 6 soon.
-
-
-
-
Hi Hue Team,
I am running hue 3.11 with solr 4.10.2 and external zookeeper. I followed exact steps mentioned in this tutorial but when proceed to the 3rd step(Index it). I got below error. However, I installed the sample solr application like twitter_demo, yelp demo are working fine. Please help !!!
[20/Jan/2017 13:01:29 -0800] conf ERROR failed to get zookeeper ensemble
Traceback (most recent call last):
File “/usr/local/hue/desktop/libs/indexer/src/indexer/conf.py”, line 48, in zkensemble
clusters = CLUSTERS.get()
AttributeError: ‘UnspecifiedConfigSection’ object has no attribute ‘get’[20/Jan/2017 13:01:31 -0800] api ERROR Could not create collection. Check response:
{
“failure”: {
“”: “org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore ‘test_online_rev_shard1_replica1’: Unable to create core [test_online_rev_shard1_replica1] Caused by: Specified config does not exist in ZooKeeper:test_online_rev”
},
“responseHeader”: {
“status”: 0,
“QTime”: 1158
}
}-
Author
ensemble in [libzookeeper] might not be set properly? https://github.com/cloudera/hue/blob/master/desktop/conf.dist/hue.ini#L1340
-
-
Hello Hue Team!
I want to try new indexer using Hue 3.11 on top of CDH-5.10(parcels).
The thing is that CDH after each hue’s restart generates hue.ini from a template and I can’t enable ‘enable_new_indexer=true’.
Could you guide me where is the source template located/how is it called?I see only this:
$ sudo find /var/run/cloudera-scm-agent -name ‘hue.ini’
/var/run/cloudera-scm-agent/process/116-hue-HUE_SERVER/hue.ini
/var/run/cloudera-scm-agent/process/115-hue-HUE_SERVER/hue.ini
/var/run/cloudera-scm-agent/process/114-hue-HUE_SERVER/hue.ini$ sudo find /opt/cloudera/ -name ‘hue.ini’
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/etc/hue/conf.empty/hue.ini$ grep enable_new_indexer /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/etc/hue/conf.empty/hue.ini
enable_new_indexer=trueThanks!
-
Author
Added a note about how to configure with CM!
-
-
Hi Hue Team!
I’m playing around with Solr Search in Hue 3.11 on top of CDH-5.10(parcels).
AFAIK I can’t create new field during indexing so I did a custom Solr query to calculate required values.
My question: I see “query definitions” button in right upper corner in hue search dashboard – can I use it to add my custom Solr’s query? If yes – how?
If no – is there a way to make my calculated values available during dashboard’s creation?
The query I use:
select?q=quantity:*&fq=fueling_type:full&fq=trip_odometer:[200+TO+*]&sort=div(quantity,div(trip_odometer,100))+desc&fl=div(quantity,div(trip_odometer,100))&wt=json&indent=trueThanks!
-
Author
The query definition is to persist the state of the dashboard (which widgets you clicked, which query string you entered)
-
-
Hi , Is it possible to dowload/export the Solr query results from HUE ?
-
Author
Yes, there’s a download icon right next to the result grid: https://dl.dropbox.com/s/gprn5ej9m51az6h/Screenshot%202017-04-25%2014.44.04.png?dl=0
-
-
The download will give 1000 rows by default , what about getting the some rows in between as in Solr query
-
Author
Like 100 rows starting from row 10 000? What is the use case to about not needing the top rows?
-
-
In the Fields selection, why I do not have add opertation?
-
Author
The feature is not ready yet but it will come with https://issues.cloudera.org/browse/HUE-5304 in the next release.
-
-
HI,
How could I make this as a real-time dash?
-
Author
You can set the dashboard to be autorefreshed and rolling on a date field (http://gethue.com/dynamic-search-dashboard-improvements-3/)
-
-
Hi All,
Got below error when tying to index parquet table, could you please help on this.
“Error: org.kitesdk.morphline.api.MorphlineRuntimeException: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.kitesdk.morphline.base.FaultTolerance.handleException(FaultTolerance.java:73) at org.apache.solr.hadoop.morphline.MorphlineMapRunner.map(MorphlineMapRunner.java:220) at org.apache.solr.hadoop.morphline.MorphlineMapper.map(MorphlineMapper.java:86) at org.apache.solr.hadoop.morphline.MorphlineMapper.map(MorphlineMapper.java:54) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.IllegalArgumentException: INT96 not yet implemented. at parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:252) at parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:237) at parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:222) at parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:236) at parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:216) at parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:210) at parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:123) at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:178) at parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152) at parquet.hadoop.ParquetReader.read(ParquetReader.java:128) at org.kitesdk.morphline.hadoop.parquet.avro.ReadAvroParquetFileBuilder$ReadAvroParquetFile.doProcess(ReadAvroParquetFileBuilder.java:172) at org.kitesdk.morphline.base.AbstractCommand.process(AbstractCommand.java:161) at org.kitesdk.morphline.base.Connector.process(Connector.java:64) at org.kitesdk.morphline.base.AbstractCommand.doProcess(AbstractCommand.java:186) at org.kitesdk.morphline.base.AbstractCommand.process(AbstractCommand.java:161) at org.kitesdk.morphline.base.AbstractCommand.doProcess(AbstractCommand.java:186) at org.kitesdk.morphline.base.AbstractCommand.process(AbstractCommand.java:161) at org.apache.solr.hadoop.morphline.MorphlineMapRunner.map(MorphlineMapRunner.java:208)”
-
Author
We would recommend to ask the question to the Morphline forum: http://community.cloudera.com/t5/Kite-SDK-includes-Morphlines/bd-p/DevKit
-
-
Is the Cluster kerberized?
-
Got below error while Indexing data in Kerberized Cluster. Could you please help on this
ERROR org.apache.solr.hadoop.GoLive – Error sending live merge command HTTP Status 401 – Authentication requiredIt worked when we tried to index data in non-Kerberized cluster.
-
Even i got the same error as kaushik. Any solution when using kerberized Cluster.
-