Thursday, May 13, 2010

open-source Chinese TokenizerLibrary PaoDing with Apache Solr 1.4

There are several tokerlizer libraries that comes with Solr 1.4 by default. ChineseTokenlizerFactory is the only one that can analyze and split Chinese text. however, it’s very basic and kind of poor library if you want to do some serious thing with Chinese text.
the following screenshot is the output of the standard ChineseTokerlizerFactory analysis. “Hello World” in Chinese will be “你好世界”, the correct analyzed tokens should be 你好 and 世界,instead of 4 meanness characters 


we expect the following result instead.



PaoDing is a popular and open-source libary that is pretty much to do the splitting and analysis job. Here is the step to integrate it with the Solr server.

1. Down paoding bits from googlecode. upzip the bits and setup one system environment variable which points to the dic folder like PAODING_DIC_HOME=c:\solr\paoding\dic
   2. Build one TokerLizerFactory that uses the paoding analyzer.basically two java class.

package test.solr.PaodingTokerFactory;

import net.paoding.analysis.analyzer.PaodingAnalyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;

public class PaoDingTokerLizer extends Tokenizer {

    PaodingAnalyzer analyzer = new PaodingAnalyzer();
    TokenStream ts;
    public PaoDingTokerLizer(Reader reader) {
        // TODO Auto-generated constructor stub
        ts = analyzer.tokenStream("",reader);
    public Token next() {
        // TODO Auto-generated method stub
        try {
        } catch (IOException e) {
            // TODO Auto-generated catch block
        return null;


package test.solr.PaodingTokerFactory;


import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;

public class PaoDingTokenizerFactory extends BaseTokenizerFactory
    public Tokenizer create(Reader arg0) {
        // TODO Auto-generated method stub
        return new PaoDingTokerLizer(arg0);


3. Build/Compile the code, and copy the compiled Jar or classes to tomcat lib directory [tomcat\webapps\solr\WEB-INF\lib] where lucen*.jar exists,Also copy the compiled paoding-analysis.jar here
4. Change solr schema.xml file to reference the new library

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <!-- <tokenizer class="solr.ChineseTokenizerFactory"/> -->
       <tokenizer class="test.solr.PaodingTokerFactory.PaoDingTokenizerFactory"/>

5. restart the solr server and then you go!

Some references:
Lucid Solr: enterprise lelve support for Solr/lucene
Lucene in Action:
Paoding Library:

Solr 1.4 Enterprise Search Server Great start and everything about Solr
Building Search Applications: Lucene, LingPipe, and Gate :the base of Solr, How Indexing works.


Anonymous said...

Hi, thanks for your helpful article. I found it easy to follow and set up my Lucene to use the Paoding Tokenizer.

I noticed though that when I search, no results are returned for most queries. I tried using the Paoding tokenizer as the query field tokenizer but that didn't help.

Any comments would be great. Thanks, Aly

androidyou said...

Make sure you pick up the same tokenizer when you run the query.

here is the thing,
at indexing time, tokenizer spit the work to pieces of tkens. then stored into the index.

at query time, use the same toenizer to split the query keyword into tokens, then user the tokens to query the index.

hope it helps.

also you may turn on some tracing or logging to make sure it did get picked up in your tokenizer library.

Anonymous said...

Hi thanks for your quick answer.
I use Paoding when indexing documents and when analyzing the queries. My Field analysis tool shows the same results as your screenshot above.
In the schema I have added:



I just tried indexing this document:
And then search for Hello world, i.e. 你好世界 - 0 results.

What am I doing wrong?
Thanks a lot in advance

Anonymous said...

Sorry, schema.xml contains test.solr.PaodingTokerFactory.PaoDingTokenizerFactory as analyzer for both type="index" and type="query" in fieldType name = "text".
Is there anywhere else I should add it?

androidyou said...

HI, If that's the case, I would say that's a encoding issue.

try this,
use the luke toolkit to browse the index folder. make sure index is correct. (check this for reference,

Also, if possible , change the java server seting make sure the uri encoding is utf-8 or something which covers the Chinese character.

for Tomcat, change the server.xml
make sure the connector use utf-8 encoding for the url, 'cause solr use the URI to pass the query keyword.

hope this helps.

androidyou said...

Connector port="8888" protocol="HTTP/1.1"
redirectPort="8443" URIEncoding="UTF-8"

Anonymous said...

I've used this to index some chinese text and it worked great.

When coming to search queries on the index it could only search onec - on the second query the Solr caught an exception that the input reader was closed - When calling the read on line 166 in the tokenizer.
Any ideas?

rajinimaski said...

Liked the blog:)

rajinimaski said...


I was trying out with this configuration. I have neatly followed every steps and was able to create indexes in solr with fieldype mentioned as above.

But when I am searching with this field, search results are 0.

What might be the reason ? My search query was

Thank you.
Awaiting reply


Locations of visitors to this page