Thursday, February 17, 2011

How to: install and config hadoop to run in fully distributed Mode, Centos

in order to run the hadoop in fully distributed mode, at least we need two VMs. I will config a hadoop cluster with two Data Nodes , one namenode and secondary namenode.

MachineName Role
Home (192.168.209.130) Primary + Secondary Namenode, Slave(data Node), Job Tracker Node
LA (192.168.209.132) Slave Node (Data Node)

there are some prerequisites to run hadoop,

  1. Create Hadoop user and group on each Server.
  2. on Home Node, generate the ssh key and export the pub key to both servers. hadoop user on home node can access each server (login without password)
  3. install java 1.6
  4. configuration change. (hosts file, hadoop conf directory.)


for the configuration, basically 3 xml located in the hadoop/conf directory.

HDFS-Site.xml, the essential configration to HDFS, since we only have two data nodes, change the replciation factor to 2. and assign the path of the name server, you ‘d better keep the file into the hadoop directory, or somewhere else make sure hadoop can update that folder.

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
     <name>dfs.permissions</name>
     <value>false</value>
  </property>
  <property>
     <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
     <name>dfs.name.dir</name>
    <value>/usr/lib/hadoop/cache/hadoop/dfs/name</value>
  </property>
</configuration>


Core-site.xml, assign the cluster name, I use name server Home here

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://Home:8020</value>
  </property>

  <property>
     <name>hadoop.tmp.dir</name>
     <value>/usr/lib/hadoop/cache/${user.name}</value>
  </property>
</configuration>
~


mapred-site.xml


<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>home:8021</value>
  </property>
</configuration>
~

then How can we tell the cluster what are those members? the magic is the masters and slavers file under conf directory

masters
just put the secondary name server, I will put Home here

slavers,
both servers, home and LA

image

After that, copy the hadoop directory with permission to LA server.

then on the name server , here will be home, run “bin/hadoop nameserver –format”
once done, make sure no error.

time to start up the Clusters.
How to do that, just run “bin/start-all.sh”, the script will check all those server list for their roles, and ssh to that server to start the JVM.

image

Basically, as I mentationed in the beginning, in Home server just 5 role with each as a jvm process.
here is the 5 jvm argument.

NameNode

java  -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-namenode-home.log          -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.hdfs.server.namenode.NameNode

data node

java  -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-datanode-home.log          -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath  org.apache.hadoop.hdfs.server.datanode.DataNode

secondary name node

java  -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-secondarynamenode-home.log -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode

jobtrackr

java  -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-jobtracker-home.log -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.mapred.JobTracker

tasktrackr

java -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-tasktracker-home.log -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.mapred.TaskTracker

once done, you can access http://home:50070, to see the cluster is running with two nodes.
image

image

More Hadoop Blogs

  1. How to: create a hadoop map/reduce job tutorial
  2. How to: install and config hadoop to run in fully distributed Mode, Centos
  3. How to : setup SSH authentication over keys, hadoop installation

No comments:

 
Locations of visitors to this page