Thursday, May 13, 2010

open-source Chinese TokenizerLibrary PaoDing with Apache Solr 1.4

There are several tokerlizer libraries that comes with Solr 1.4 by default. ChineseTokenlizerFactory is the only one that can analyze and split Chinese text. however, it’s very basic and kind of poor library if you want to do some serious thing with Chinese text.
the following screenshot is the output of the standard ChineseTokerlizerFactory analysis. “Hello World” in Chinese will be “你好世界”, the correct analyzed tokens should be 你好 and 世界,instead of 4 meanness characters 


image

we expect the following result instead.

image

 

PaoDing is a popular and open-source libary that is pretty much to do the splitting and analysis job. Here is the step to integrate it with the Solr server.

1. Down paoding bits from googlecode. upzip the bits and setup one system environment variable which points to the dic folder like PAODING_DIC_HOME=c:\solr\paoding\dic
   2. Build one TokerLizerFactory that uses the paoding analyzer.basically two java class.

package test.solr.PaodingTokerFactory;

import java.io.IOException;
import java.io.Reader;
import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;

public class PaoDingTokerLizer extends Tokenizer {

    PaodingAnalyzer analyzer = new PaodingAnalyzer();
    TokenStream ts;
    public PaoDingTokerLizer(Reader reader) {
        // TODO Auto-generated constructor stub
        super(reader);
        ts = analyzer.tokenStream("",reader);
    }
    @Override
    public Token next() {
        // TODO Auto-generated method stub
        try {
            return ts.next();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        return null;
    }

}

package test.solr.PaodingTokerFactory;

import java.io.Reader;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;

public class PaoDingTokenizerFactory extends BaseTokenizerFactory
{     
    public Tokenizer create(Reader arg0) {
        // TODO Auto-generated method stub
        return new PaoDingTokerLizer(arg0);
    }

}

3. Build/Compile the code, and copy the compiled Jar or classes to tomcat lib directory [tomcat\webapps\solr\WEB-INF\lib] where lucen*.jar exists,Also copy the compiled paoding-analysis.jar here
4. Change solr schema.xml file to reference the new library

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <!-- <tokenizer class="solr.ChineseTokenizerFactory"/> -->
       <tokenizer class="test.solr.PaodingTokerFactory.PaoDingTokenizerFactory"/>

5. restart the solr server and then you go!

Some references:
Lucene: http://lucene.apache.org/java/docs/
Solr: http://lucene.apache.org/solr/
Lucid Solr: enterprise lelve support for Solr/lucene http://www.lucidimagination.com/blog/
Lucene in Action:http://www.manning.com/hatcher2/
Paoding Library: http://code.google.com/p/paoding/downloads/list

Books:
Solr 1.4 Enterprise Search Server Great start and everything about Solr
Building Search Applications: Lucene, LingPipe, and Gate :the base of Solr, How Indexing works.

9 comments:

Anonymous said...

Hi, thanks for your helpful article. I found it easy to follow and set up my Lucene to use the Paoding Tokenizer.

I noticed though that when I search, no results are returned for most queries. I tried using the Paoding tokenizer as the query field tokenizer but that didn't help.

Any comments would be great. Thanks, Aly

androidyou said...

Make sure you pick up the same tokenizer when you run the query.

here is the thing,
at indexing time, tokenizer spit the work to pieces of tkens. then stored into the index.

at query time, use the same toenizer to split the query keyword into tokens, then user the tokens to query the index.

hope it helps.

also you may turn on some tracing or logging to make sure it did get picked up in your tokenizer library.

Anonymous said...

Hi thanks for your quick answer.
I use Paoding when indexing documents and when analyzing the queries. My Field analysis tool shows the same results as your screenshot above.
In the schema I have added:





...



...


I just tried indexing this document:
http://zh.wikipedia.org/zh/Ada
And then search for Hello world, i.e. 你好世界 - 0 results.

What am I doing wrong?
Thanks a lot in advance

Anonymous said...

Sorry, schema.xml contains test.solr.PaodingTokerFactory.PaoDingTokenizerFactory as analyzer for both type="index" and type="query" in fieldType name = "text".
Is there anywhere else I should add it?

androidyou said...

HI, If that's the case, I would say that's a encoding issue.

try this,
use the luke toolkit to browse the index folder. make sure index is correct. (check this for reference, http://androidyou.blogspot.com/2010/09/hello-lucene-indexing-and-searching.html)

Also, if possible , change the java server seting make sure the uri encoding is utf-8 or something which covers the Chinese character.

for Tomcat, change the server.xml
make sure the connector use utf-8 encoding for the url, 'cause solr use the URI to pass the query keyword.



hope this helps.

androidyou said...

Connector port="8888" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"

Anonymous said...

Hi.
I've used this to index some chinese text and it worked great.

When coming to search queries on the index it could only search onec - on the second query the Solr caught an exception that the input reader was closed - When calling the read on line 166 in the tokenizer.
Any ideas?

rajinimaski said...

Liked the blog:)

rajinimaski said...

Hi,

I was trying out with this configuration. I have neatly followed every steps and was able to create indexes in solr with fieldype mentioned as above.







But when I am searching with this field, search results are 0.

What might be the reason ? My search query was
P_CAT_NAME:"液晶电视机"


Thank you.
Awaiting reply

Regards
Rajani

 
Locations of visitors to this page