In the previous episode we saw how to use Pig and Hive with HBase. This time, let’s see how to make our Yelp data searchable by indexing it and building a customizable UI with the Hue Search app.
Indexing data into Solr
This tutorial is based on SolrCloud. Here is a step by step guide about its installation and a list of required packages:
- solr-server
- solr-mapreduce
- search
Next step is about deploying and configuring Solr Cloud. We are following the documentation.
After this, we create a new collection and index named ‘reviews’. We use our predefined schema that needs to be copied from the Hadoop tutorial github.
cp solr_local/conf/schema.xml solr_configs/conf/schema.xml solrctl instancedir --create reviews solr_local solrctl collection --create reviews -s 1
We replace the field definitions in the schema with a mapping corresponding to our Yelp data. The schema represents each data fields that will be available in the search index. You can read more about schema.xml in the Solr wiki.
<field name="business_id" type="text_en" indexed="true" stored="true" /> <field name="cool" type="tint" indexed="true" stored="true" /> <field name="date" type="text_en" indexed="true" stored="true" /> <field name="funny" type="tint" indexed="true" stored="true" /> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="stars" type="tint" indexed="true" stored="true" /> <field name="text" type="text_en" indexed="true" stored="true" /> <field name="type" type="text_en" indexed="true" stored="true" /> <field name="useful" type="tint" indexed="true" stored="true" /> <field name="user_id" type="text_en" indexed="true" stored="true" /> <field name="name" type="text_en" indexed="true" stored="true" /> <field name="full_address" type="text_en" indexed="true" stored="true" /> <field name="latitude" type="tfloat" indexed="true" stored="true" /> <field name="longitude" type="tfloat" indexed="true" stored="true" /> <field name="neighborhoods" type="text_en" indexed="true" stored="true" /> <field name="open" type="text_en" indexed="true" stored="true" /> <field name="review_count" type="tint" indexed="true" stored="true" /> <field name="state" type="text_en" indexed="true" stored="true" />
Then, we retrieve and clean a subset of our Yelp data with a Hive query, download it as CSV and index it with the indexer tool and this command:
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search*/examples/solr-nrt/log4j.properties --morphline-file solr_local/reviews.conf --output-dir hdfs://localhost:8020/tmp/load --verbose --go-live --zk-host localhost:2181/solr --collection reviews hdfs://localhost:8020/tmp/query_result.csv
The command will use our morphline file to map the Yelp data to the fields defined in our index schema.xml.
While debugging morphline, the —dry-run option will save you some time.
Customize the search result
The administration panel lets you tweak the look & feel and features of the search page. This is explained in the second part of the video.
Conclusion
Cloudera Search is great for opening your user base to Hadoop and do quick data retrieval. Some other articles describes greatly some user use cases, like email or customer data search.
Cloudera Morphline is also an interesting tool for facilitating the indexing of your data. You can learn more about it on its project website.
As usual feel free to comment on the hue-user list or @gethue!
Troubleshooting
- If you see this error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error CREATEing SolrCore ‘reviews_shard1_replica1’: Unable to create core: reviews_shard1_replica1 Caused by: Could not find configName for collection reviews found:null
You might have forgotten to create the collection:
solrctl instancedir --create review solr_configs
2. If you see this error:
ERROR - 2013-10-10 20:01:21.383; org.apache.solr.servlet.SolrDispatchFilter; Could not start Solr. Check solr/home property and the logs ERROR - 2013-10-10 20:01:21.409; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: solr.xml not found in ZooKeeper at org.apache.solr.core.ConfigSolr.fromSolrHome(ConfigSolr.java:109) Server is shutting down
You might need to force Solr to reload the configuration. Beware, this might break ZooKeeper and you might need to read error #3.
3. If you see this error:
KeeperErrorCode = NoNode for /overseer/collection-queue-work</str> <str name="trace"> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /overseer/collection-queue-work
It probably comes from error #2. You might need to re-upload the config and recreate the collection.