Wednesday, September 8, 2010

Hello Lucene, Indexing and searching

I am reading the book Lucene in Action, Second Edition: Covers Apache Lucene 3.0. in the chapter one, there is one basic java program which do the 101 indexing and searching. 

  Here are some basic tutorial to do that.
  1. there is only one core jar file necessary for the engine to run, you can download it from http://www.apache.org/dyn/closer.cgi/lucene/java/

2. open the eclipse , create one java project and reference the core jar file.

3. Create a text file , and put some contents. then save it as test.txt

4. write some java code to do the indexing and searching.
 

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Fieldable;
import org.apache.lucene.index.*;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;

public class Program {

    public static void main(String[] args) {
        try {
            IndexFile("/Users/androidyou/Documents/lucence/data/test.txt",
                    "/Users/androidyou/Documents/lucence/index");

            Search("/Users/androidyou/Documents/lucence/index","nonexistedkeyworld");
            Search("/Users/androidyou/Documents/lucence/index","apache");

        } catch (Exception e) {
            // TODO Auto-generated catch block
        }
        System.out.println("done");
    }

    private static void Search(String indexpath, String keyword) throws Exception, IOException {
        IndexSearcher searcher=new IndexSearcher(FSDirectory.open(new File(indexpath)));
        System.out.println("Search  keyword " + keyword);
        Query query=new QueryParser(Version.LUCENE_30, "content", new StandardAnalyzer(Version.LUCENE_30)).parse(keyword);

        TopDocs docs= searcher.search(query, 10);
        System.out.println("hits " + docs.totalHits);
        for(ScoreDoc doc: docs.scoreDocs)
        {
            System.out.println("doc id" + doc.doc + "doc filename" + searcher.doc(doc.doc).get("filename")) ;
        }

    }

    private static void IndexFile(String datafolder, String indexfolder) throws CorruptIndexException, LockObtainFailedException, IOException {
        Analyzer a=new StandardAnalyzer(Version.LUCENE_30);
        Directory d=FSDirectory.open(new File(indexfolder));
        MaxFieldLength mfl=new MaxFieldLength(4000);
        IndexWriter iw=new IndexWriter(d, a, mfl);

        Document doc=new Document();
        Fieldable contentfield=new Field("content", new FileReader(datafolder));
        doc.add(contentfield);
        Fieldable namefield=new Field("filename",datafolder, Store.YES, Index.NOT_ANALYZED);
        doc.add(namefield);

        iw.addDocument(doc);
        iw.commit();

    }
}

And here, if you run the program three times, there will be three “Documents” in the index repository.
  here, I will get

Search keyword nonexistedkeyworld
hits 0
Search keyword apache
hits 3
doc id 0 doc filename/Users/androidyou/Documents/lucence/data/test.txt
doc id 1 doc filename/Users/androidyou/Documents/lucence/data/test.txt
doc id 2 doc filename/Users/androidyou/Documents/lucence/data/test.txt
done

also you can download the lucene toolkit luke. and Open the index directory.

a1

from the snapshoot above, you can see there are 3 documents inside the Index. for each document, it has two fields. totally 58+1=59 terms

for the content field. by default . method Field(name , filereader) only index the field, not store it.
  when you click Documents tab, you can browse the document individually.  also you can verify that only filename is stored in the index. for the content filed, just terms. (indexed content.)

Screen shot 2010-09-08 at 11.26.41 AM

No comments:

 
Locations of visitors to this page