Tuesday, August 31, 2010

Nutch Crawing and Searching Test on MAC/Linx

Nutch is one Open-source web-search software, built on Lucene Java. it can be used to crawl the website(intranet/internet) and expose search UI for those crawled and indexed content.
Among the lucene ecosystem, Nutch is one great component as the content generation tool. It can also work together with Solr to provide a more rich query functionality. such as faceting.
Let me run a quick tutorial to Index my blogpage, and turn on the query interface by deploying the webapp to tomcat.

1. Download the Nutch bits from apache website. I'll pick the apache-nutch-1.1-bin.tar.gz , download and extract to a local folder. like ~/apache-nutch

2. make sure the java_home is defined. in MAC, the java_home will be

/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/

you may run export JAVA_home=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/

3. Create a directory named Output to keep the crawled and indexed content, also make one file like test.url which contains the root url list. I will put http://androidyou.blogspot.com in the file.

4. Change the conf/crawl-urlfilter.txt, make sure only index the url begins with http://androidyou.blogspot.com in my example

# accept hosts in MY.DOMAIN.NAME
+^http://androidyou.blogspot.com

5. Change the conf/nutch-default.xml. assign one default useragent.

<property>
<name>http.agent.name</name>
<value>AndroidyouTestAgent</value>

6. Startup the crawling and indexing process. By enter

bin/nutch crawl test.url -dir Output -depth 1 >& log.log
you may view the log.log to see if there anything wrong.

7 Verify the Index is created. ls Ouput
you should be be able to see 5 folders like

crawldb index indexes linkdb segments

8. Download the Lucence toolkit luke. to view the index and document. http://code.google.com/p/luke/
run the lukeall.jar , browse to the folder of the Output/Index

you will get an idea like how many documents get crawled and indexed. what are those top terms.

Now the crawling and indexing Process is Done. Let's Host the searching function to the tomcat.

A. Download and Install Tomcat. then startup the service.
B. Copy the nutch-1.1.war to the tomcat webapps folder. (this will deploy the app to tomcat.)
C. change the webapps/nutch-1.1/web-info/classes/nutch-site.xml. we should tell the runtime where is the indexed folder.
here is my file.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>searcher.dir</name>
<value>/Users/androidyou/Documents/apache-nutch/Output</value>
</property>
</configuration>

then restart the Tomcat.
browse the url http://localhost:8080/nutch-1.1/en

search androidyou, get one result.

you can click cache or explain to get more details like lucence doucment fields.

Now you can push the index result to solr



No comments:

 
Locations of visitors to this page