Friday, September 3, 2010

Update the Nutch index to Solr

After we crawled and indexed the site using nutch, Now time to refresh and update the bits to Solr and make it available for Solr search.

Basic steps,
Download and Install the Solr.
make sure http://localhost:8093/solr/admin (if you host in tomcat by default, the port will be 8080) is browsable.
Change the Solr scheme.xml, add several additional fields to accommodate the nutch index structure.

then run the nutch solrindex command.

bin/nutch solrindex http://127.0.0.1:8080/solr/ androidyou3/crawldb androidyou3/linkdb androidyou3/segments/*

Check the tomcat or jetty access log, see if the post is successfully.
if you see http status code 400, the post just failed. always it means something wrong in the solr side, check the server log. ( most time, this means this is no required field in the scheme.xml, or it require a multivealued field which is missing in your configuration.)

here is one bad entry

127.0.0.1 - - [03/Sep/2010:10:45:29 -0700] "POST /solr/update?wt=javabin&version=1 HTTP/1.1" 400 1055

then you may see the following error in tomcat server [ require one multivalue field called anchor]

SEVERE: org.apache.solr.common.SolrException: ERROR: [http://androidyou.blogspot.com/] multiple values encountered for non multiValued field anchor: [My Note on Solutions., Home, Newer Posts]
at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:214)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:555)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109

if successful, you will see this.
127.0.0.1 - - [03/Sep/2010:10:54:53 -0700] "POST /solr/update HTTP/1.1" 200 41

then in the solr, check status page, update section. notice the commit count and cumulative_adds changes from 0 to some value else.

then you go solr admin, you can run the query.


1 comment:

Bahadir said...

Did you realize that default nutch configuration does not able to retrieve the non-full urls ?

if the a tag is like a href="/product/test.html" it does not add to its url list. Do you know a way to fix this ?

 
Locations of visitors to this page