Wednesday, November 16, 2011

How to: Install and Test apache mahout on hadoop

Mahout and Hadoop are all java libraries basically, mahout use the Maven tool to build the source code and maintain the dependency.
So we need make sure we have the following bits ready.

  • JDK
  • Maven
  • Hadoop
  • Mahout

 

I will start from the fresh centos, then get all those stuff ready step by step.

install JDK.
GO to Oracle JDK download site, http://www.oracle.com/technetwork/java/javase/downloads/index.html, I still prefer the Java 6 instead of 7, pickup one the .bin link, download and run it directly. I will put the java under /usr/lib/jdk6 folder.
image 
Export the bin directory to PATH, and jdk6 to JAVA_HOME environment variable.

Install MAVEN
Download the binary package from http://maven.apache.org/download.html, here I chose the 2.2.1 version which is more stable. 
image

Extract the zip file, and link it to /usr/lib/maven, then.
Export /usr/lib/maven to the PATH. now, you can run mvn –version, to make sure it works at least we can get the version,
image
For settings like proxy, check it out here, http://maven.apache.org/download.html#Maven_Documentation

Install HADOOP.
you can check this out, if you want to install hadoop as a fully distributd mode, How to: install and config hadoop to run in fully distributed Mode, Centos.

here we just have one vm, so keep it easy for the mahout testing. I will use the Cloudera distribution,

Download the repo file for centos 5, http://archive.cloudera.com/redhat/cdh/cloudera-cdh3.repo and copy it to yum repo directory.
imageNow just search hadoop, you will see all the components,we will use the hadoop-0.20-conf-pseudo one.
yum install  hadoop-0.20-conf-pseudo

image

once done, go to /usr/lib/hadoop/conf directory, change the java home to /usr/lib/jdk6 in hadoop-env.sh

image

Then run as hdfs, format the namenode,

image
then start those daemons like /etc/init.d/hadoop-*, run JPS, you should see all the java process there,
image

now we can run a simple test, go the /usr/lib/hadoop, run

image
we can just copy one file there,
image
open a browser, go to http://localhost:50070, you can see the file we just uploaded is there,
image
Now , HDFS is ready. we can run a mapreduce job to make sure hadoop is ready.
image

If no error, we are all set, hadoop is ready.

Install Mahout.

Download one source code, you can use svn to clone one trunk copy,

svn co http://svn.apache.org/repos/asf/mahout/trunk



























and copy this code to /usr/lib/mahout

then run mvn install –DskipTests to compile the source, mvn will figure out the dependency and fetch those jars for you automatically,


image





it takes time to download all the jar, mileage depends.


Here is my MPG, take several minutes,


image





Now, export /usr/lib/mahout/bin to PATH , then we can run mahout from the shell.


If you cant exectute the mahout, give it one execute permission.


run mahout, will list all the options to go with different algorithms.


image



Then go the examples folder, run mvn compile



image



Now, you can run some example like the one to classify the news groups.



image





Here we didn’t sepecify the HADOOP_HOME, so it will run locally. the shell script will download data, prepare the data, then run the classifier.



image



when done, it will show the confusion matrix against the testing data.

image

6 comments:

shubhangi pagar said...

vry nice blog..............actually
I have many queries....
1. how to export /usr/lib/mahout/bin to PATH
2. m done with installation by following ur blog...almost everything went fine bt getting error for the command "sh classify-20newsgroups.sh"
sh: classify-20newsgroups.sh: No such file or directory

m new to this so nt getting what is this.....

3. is it necessary to run every command through "root"????

thnx a lot :)

selvi said...

i am getting the following error
[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools ................................ SUCCESS [3.160s]
[INFO] Apache Mahout ..................................... SUCCESS [0.934s]
[INFO] Mahout Math ....................................... SUCCESS [48.081s]
[INFO] Mahout Core ....................................... SUCCESS [1:02.933s]
[INFO] Mahout Integration ................................ SUCCESS [2:49.722s]
[INFO] Mahout Examples ................................... SUCCESS [2:01.534s]
[INFO] Mahout Release Package ............................ SUCCESS [0.092s]
[INFO] Mahout Math/Scala wrappers ........................ FAILURE [2:03.532s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8:50.657s
[INFO] Finished at: Fri Sep 05 19:35:01 IST 2014
[INFO] Final Memory: 55M/312M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project mahout-math-scala: Could not resolve dependencies for project org.apache.mahout:mahout-math-scala:jar:0.9: Could not transfer artifact org.scalatest:scalatest_2.9.2:jar:1.9.1 from/to central (http://repo.maven.apache.org/maven2): GET request of: org/scalatest/scalatest_2.9.2/1.9.1/scalatest_2.9.2-1.9.1.jar from central failed: Connection reset -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :mahout-math-scala

Sanjaya said...

Akila Siriweera

Interesting RARE & worth article, works fine.
Thanks for the creator.

Note : http://svn.apache.org/repos/asf/mahout/trunk

It needs to check out whole contained of the folder.
Thumbs up....

Niki Praha said...

hi when i tried to run a example sh, the data is getting downloaded in the tmp/mahout-work-root/20news-all.
But i am getting an error,
put: `/tmp/mahout-work-root/20news-all': No such file or directory

Kindly help me out

thank u so much in advance

Rajapriya R said...

useful blog... i learnt more information about mahout from this blog.. thank you for sharing

hadoop training institute in adyar | big data training institute in adyar | hadoop training in chennai adyar | big data training in chennai adyar

Ananthi S said...

After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

hadoop training in chennai adyar | big data training in chennai adyar

 
Locations of visitors to this page