1. Download the Nutch bits from apache website. I'll pick the apache-nutch-1.1-bin.tar.gz , download and extract to a local folder. like ~/apache-nutch
2. make sure the java_home is defined. in MAC, the java_home will be
/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/
you may run export JAVA_home=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/
3. Create a directory named Output to keep the crawled and indexed content, also make one file like test.url which contains the root url list. I will put http://androidyou.blogspot.com in the file.
4. Change the conf/crawl-urlfilter.txt, make sure only index the url begins with http://androidyou.blogspot.com in my example
# accept hosts in MY.DOMAIN.NAME
+^http://androidyou.blogspot.com
5. Change the conf/nutch-default.xml. assign one default useragent.
<property>
<name>http.agent.name</name>
<value>AndroidyouTestAgent</value>
6. Startup the crawling and indexing process. By enter
bin/nutch crawl test.url -dir Output -depth 1 >& log.log
you may view the log.log to see if there anything wrong.
7 Verify the Index is created. ls Ouput
you should be be able to see 5 folders like
crawldb index indexes linkdb segments
8. Download the Lucence toolkit luke. to view the index and document. http://code.google.com/p/luke/
run the lukeall.jar , browse to the folder of the Output/Index
you will get an idea like how many documents get crawled and indexed. what are those top terms.
Now the crawling and indexing Process is Done. Let's Host the searching function to the tomcat.
A. Download and Install Tomcat. then startup the service.
B. Copy the nutch-1.1.war to the tomcat webapps folder. (this will deploy the app to tomcat.)
C. change the webapps/nutch-1.1/web-info/classes/nutch-site.xml. we should tell the runtime where is the indexed folder.
here is my file.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>searcher.dir</name>
<value>/Users/androidyou/Documents/apache-nutch/Output</value>
</property>
</configuration>
then restart the Tomcat.
browse the url http://localhost:8080/nutch-1.1/en
search androidyou, get one result.
you can click cache or explain to get more details like lucence doucment fields.
Now you can push the index result to solr
No comments:
Post a Comment