Tuesday, May 11, 2010

Client integration with Solr by using SolrJ

Solr is an open-source search server based on the Lucene Java search librar. it provides some enterprise and up-to-date searching/navigation features. some of those features like faceting,guided navigation. With the free open-source bit, you can implement some common features offered by commercial search technology like Endeca or Microsoft FAST.

Lucene is the underlying indexing/querying foundation. on top of lucene, Solr takes care more management roles(expose as Http endpoint and xml functionality) and Data integration(data pull or push from ad-hodc datasource). the solr could be hosted in any jsp server, and runs as a standalone server or Master-Slave distributed layout.  it has the build-in replication mechanism.
on the server side, here is one tutorial  http://lucene.apache.org/solr/tutorial.html

on the Client side, you may use different Library to integrate with the Solr Server, the library itself is basically the warpper of http/xml handling. here is a list of those libraries. you may also build your own by using the httpclient lib in your programming language.

  • SolrJ Solrj is a java client to access solr. It offers a java interface to add, update, and query the solr index.
  • SolrSharp the C# library as the name tells.  provides an operational framework around the Lucene search engine and runs on both Windows & Linux. SolrSharp provides an easy way to interoperate with any solr server implementation on any platform.
  • SolrRuby ruby library of solr access
  • SolrPY a python client for solr, an enterprise search server built on top of lucene. solrpy allows you to add documents to a solr instance, and then to perform queries and gather search results from solr using your favorite programming language--python.

In terms of the solr client programming debugging, there is one great tool called TCPTRACE Basically you use it as a tunnel between your client & server. i.e. the listening port number of solr server is 8983 , you may startup the TCPTRACE which listens on port 8888 and forward all traffic to 8983. you get the idea, then you can see all the round-trip http traffic

 image

always, there are two types of pushing data to SOLR.
1. Document(s)-style submit, here is one example of JAVA code,submit one document to Solr server and Commit the Change.

    SolrServer server1 = new CommonsHttpSolrServer(  "http://localhost:8888/solr");
        Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
         SolrInputDocument doc = new SolrInputDocument();
          doc.addField("title", "Programming SOLR");
        doc.addField("text", "blabla");
        doc.addField("id", "1001");
        docs.add(doc);
        server1.add(docs);

        server1.commit();

once run, you may see the captured traffic in TCPTRACE. you may noticed it sends out two http request, one is with the data and another one is the commit directive.

image

2. streaming submit. for those big chunk data of files, use this way instead.

SolrServer server = new CommonsHttpSolrServer( "http://localhost:8888/solr");
        ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
        File fileToIndex = new File("c:\\temp\\canon.txt");
        req.addFile(fileToIndex);
        req.setParam("literal.id", "1000");
        req.setParam("literal.category", "DC");

        try {
            server.request(req);
            server.commit();
        } catch (SolrServerException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

similarly, you get the traffic. No xml document this time, it use the http query stream to pass the field information and transfer content using the standard http/mime.
image

if you get other http code like 400, you may click the see the traffic see if there are any typo or forget to contain some necessary fields.

More solr information

No comments:

 
Locations of visitors to this page