There are several tokerlizer libraries that comes with Solr 1.4 by default. ChineseTokenlizerFactory is the only one that can analyze and split Chinese text. however, it’s very basic and kind of poor library if you want to do some serious thing with Chinese text.
the following screenshot is the output of the standard ChineseTokerlizerFactory analysis. “Hello World” in Chinese will be “你好世界”, the correct analyzed tokens should be 你好 and 世界,instead of 4 meanness characters
we expect the following result instead.
PaoDing is a popular and open-source libary that is pretty much to do the splitting and analysis job. Here is the step to integrate it with the Solr server.
1. Down paoding bits from googlecode. upzip the bits and setup one system environment variable which points to the dic folder like PAODING_DIC_HOME=c:\solr\paoding\dic
2. Build one TokerLizerFactory that uses the paoding analyzer.basically two java class.
package test.solr.PaodingTokerFactory; import java.io.IOException; public class PaoDingTokerLizer extends Tokenizer { PaodingAnalyzer analyzer = new PaodingAnalyzer(); } |
package test.solr.PaodingTokerFactory; import java.io.Reader; import org.apache.lucene.analysis.Tokenizer; public class PaoDingTokenizerFactory extends BaseTokenizerFactory } |
3. Build/Compile the code, and copy the compiled Jar or classes to tomcat lib directory [tomcat\webapps\solr\WEB-INF\lib] where lucen*.jar exists,Also copy the compiled paoding-analysis.jar here
4. Change solr schema.xml file to reference the new library
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> |
5. restart the solr server and then you go!
Some references:
Lucene: http://lucene.apache.org/java/docs/
Solr: http://lucene.apache.org/solr/
Lucid Solr: enterprise lelve support for Solr/lucene http://www.lucidimagination.com/blog/
Lucene in Action:http://www.manning.com/hatcher2/
Paoding Library: http://code.google.com/p/paoding/downloads/list
Books:
Solr 1.4 Enterprise Search Server Great start and everything about Solr
Building Search Applications: Lucene, LingPipe, and Gate :the base of Solr, How Indexing works.
9 comments:
Hi, thanks for your helpful article. I found it easy to follow and set up my Lucene to use the Paoding Tokenizer.
I noticed though that when I search, no results are returned for most queries. I tried using the Paoding tokenizer as the query field tokenizer but that didn't help.
Any comments would be great. Thanks, Aly
Make sure you pick up the same tokenizer when you run the query.
here is the thing,
at indexing time, tokenizer spit the work to pieces of tkens. then stored into the index.
at query time, use the same toenizer to split the query keyword into tokens, then user the tokens to query the index.
hope it helps.
also you may turn on some tracing or logging to make sure it did get picked up in your tokenizer library.
Hi thanks for your quick answer.
I use Paoding when indexing documents and when analyzing the queries. My Field analysis tool shows the same results as your screenshot above.
In the schema I have added:
...
...
I just tried indexing this document:
http://zh.wikipedia.org/zh/Ada
And then search for Hello world, i.e. 你好世界 - 0 results.
What am I doing wrong?
Thanks a lot in advance
Sorry, schema.xml contains test.solr.PaodingTokerFactory.PaoDingTokenizerFactory as analyzer for both type="index" and type="query" in fieldType name = "text".
Is there anywhere else I should add it?
HI, If that's the case, I would say that's a encoding issue.
try this,
use the luke toolkit to browse the index folder. make sure index is correct. (check this for reference, http://androidyou.blogspot.com/2010/09/hello-lucene-indexing-and-searching.html)
Also, if possible , change the java server seting make sure the uri encoding is utf-8 or something which covers the Chinese character.
for Tomcat, change the server.xml
make sure the connector use utf-8 encoding for the url, 'cause solr use the URI to pass the query keyword.
hope this helps.
Connector port="8888" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"
Hi.
I've used this to index some chinese text and it worked great.
When coming to search queries on the index it could only search onec - on the second query the Solr caught an exception that the input reader was closed - When calling the read on line 166 in the tokenizer.
Any ideas?
Liked the blog:)
Hi,
I was trying out with this configuration. I have neatly followed every steps and was able to create indexes in solr with fieldype mentioned as above.
But when I am searching with this field, search results are 0.
What might be the reason ? My search query was
P_CAT_NAME:"液晶电视机"
Thank you.
Awaiting reply
Regards
Rajani
Post a Comment