Monday, February 21, 2011

How to: create a hadoop map/reduce job tutorial

After we install and config hadoop to run in fully distributed Mode, Centos, now it’s time to write a hello world program to run on top of the hadoop cluster in distributed mode.

I use the 1.8M text file downloaded from, here is a file layout.

"01","35010","AL","NEW SITE",85.951086,32.941445,19942,0.004935

we will write a map reduce job to do a basic summary to list the number of citys in each State.

some configuration change,
Since the file is only 1.8M and we have two datanodes(task nodes), we will override the default block size to 500K, so that two task nodes can run at the same time.

  <description>The default block size for new files.</description>

then we use the hadoop command line to push the file to the cluster.
hadoop fs –mkdir raw
hadoop fs –copyFromLocal zip.txt raw

then from the name web interface, we can tell the zip.txt has 4 Blocks

Now time to write a simple map/reduce java program.  use your favorite IDE eclipse , create a java project and add the lib of those hadoo*.jar to the project builder lib. and here is my basic program.  I put some comments, here

public class statecount extends Configured implements Tool {

public int run(String[] args) throws Exception {

    if (args != null && args.length != 2) {
        System.out.println("statecount   <input> <output>");
        return -1;

    Configuration conf = this.getConf();
    JobConf jobconf = new JobConf(conf, statecount.class);
    Path in = new Path(args[0]);
    Path out = new Path(args[1]);

    jobconf.setJobName("Hello world/City count by State");

    FileInputFormat.setInputPaths(jobconf, in);
    FileOutputFormat.setOutputPath(jobconf, out);





    return 0;

public static void main(String[] args) throws Exception {
int res = Configuration(), new statecount(), args);

public static class SMapClass extends MapReduceBase implements
    Mapper<LongWritable, Text, Text, LongWritable>,
    Reducer<Text, LongWritable, Text, Text> {

public void map(LongWritable key, Text value,
        OutputCollector<Text, LongWritable> output,
        Reporter paramReporter) throws IOException {
    // "01","35004","AL","ACMAR",86.51557,33.584132,6055,0.001499

    String s = value.toString();
    output.collect(new Text(s.split(",")[2]), new LongWritable(1));
    //Collect AL, 1, here
    //get the third value which will be the State

public void reduce(Text key, Iterator<LongWritable> values,
        OutputCollector<Text, Text> output, Reporter paramReporter)
        throws IOException {

    long i = 0;
    while (values.hasNext()) {
        i += 1;
        Object o =; //I forget this line, there will be deadloop, reduce never over:(
    output.collect(key, new Text("#number of citys" + i));


then click the export menu to export classes to a jar file , then copy to the name server.
run “hadoop jar helloworld.jar statecount raw/zip.txt  ouputraw” on name server(job track server.)

from 50030, we can see two task nodes are running


whole statistics,

From the console you sumit job, you can see the job output.


after done, go to filesystem browser from :50070 web portal. we can see the final result.

if you want to make sure that reduce task is distributed, you can put a hostname into the reduce job and run again.

output.collect(key, new Text("#number of citys" + i + "on host" + InetAddress.getLocalHost().getHostName() ));

and assign the reduce tasks

jobconf.setJobName("Hello world/City count by State");

in the output hdfs folder, you will see four files.

click one file, this reduce job is executed on host Home


Another one is one Host LA


More Hadoop Blogs

  1. How to: create a hadoop map/reduce job tutorial
  2. How to: install and config hadoop to run in fully distributed Mode, Centos
  3. How to : setup SSH authentication over keys, hadoop installation

how to :eclipse win64 , java.lang.UnsatisfiedLinkError: Can't load library: /C:/eclipse/eclipse64/eclipse/configuration/org.eclipse.osgi/bundles/292/1/.cp/lib/linux/x86/

I found it’s a very tricky issue when you are using jdeclipse add-in, and switching between win 32 and win 64.

one day, I get this error, as attached,
eclipse win64 , java.lang.UnsatisfiedLinkError: Can't load library: /C:/eclipse/eclipse64/eclipse/configuration/org.eclipse.osgi/bundles/292/1/.cp/lib/linux/x86/


How to fix that.

  1. Make sure you download the Right version of Eclipse, on Windows 64 , please download the 64 bit version eclipse and using the 64 bit JVM.
    ( in the, put a –vm pathyour64bitjre/bin/java.exe”)
  2. After you finish the Eclipse download, Download the jdeclipse and unzip the file to eclipse folder.
  3. When you open a Workspace which is generated by 32 bit eclipse, Please chose a new workspace, or manually edit the pref file [workspace]\.metadata\.plugins\org.eclipse.core.runtime\.settings, replace x86 with x86_64.
    otherwise, you will be in a situation that 64 bit host to load 32 bit dll, ( eclipse figure out a wrong system architecture., cached by previous editor)

Sunday, February 20, 2011

Mac OSX stacks on Windows 7

A stack is a Dock item that gives you fast access to a folder. When you click a Stack, the files within spring from the Dock in a fan or a grid, depending on the number of items (or the preference you set). Snow Leopard starts you off with three default Stacks: one for Applications, one for downloads, and the other for documents.  

here is a application stack,


If you need the same feature in windows 7, there is one free add on called 7Stack. you can get some effect like this


Steps, Download the 7Stack

create one stack, basically Chose one folder, and assign the options, then click generate , it will create one shortcut. you can drag and drop the shortcut to the taskbar.


Thursday, February 17, 2011

How to: install and config hadoop to run in fully distributed Mode, Centos

in order to run the hadoop in fully distributed mode, at least we need two VMs. I will config a hadoop cluster with two Data Nodes , one namenode and secondary namenode.

MachineName Role
Home ( Primary + Secondary Namenode, Slave(data Node), Job Tracker Node
LA ( Slave Node (Data Node)

there are some prerequisites to run hadoop,

  1. Create Hadoop user and group on each Server.
  2. on Home Node, generate the ssh key and export the pub key to both servers. hadoop user on home node can access each server (login without password)
  3. install java 1.6
  4. configuration change. (hosts file, hadoop conf directory.)

for the configuration, basically 3 xml located in the hadoop/conf directory.

HDFS-Site.xml, the essential configration to HDFS, since we only have two data nodes, change the replciation factor to 2. and assign the path of the name server, you ‘d better keep the file into the hadoop directory, or somewhere else make sure hadoop can update that folder.

     <!-- specify this so that running 'hadoop namenode -format' formats the right dir -->

Core-site.xml, assign the cluster name, I use name server Home here





then How can we tell the cluster what are those members? the magic is the masters and slavers file under conf directory

just put the secondary name server, I will put Home here

both servers, home and LA


After that, copy the hadoop directory with permission to LA server.

then on the name server , here will be home, run “bin/hadoop nameserver –format”
once done, make sure no error.

time to start up the Clusters.
How to do that, just run “bin/”, the script will check all those server list for their roles, and ssh to that server to start the JVM.


Basically, as I mentationed in the beginning, in Home server just 5 role with each as a jvm process.
here is the 5 jvm argument.


java  -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-namenode-home.log          -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.hdfs.server.namenode.NameNode

data node

java  -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-datanode-home.log          -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath  org.apache.hadoop.hdfs.server.datanode.DataNode

secondary name node

java  -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-secondarynamenode-home.log -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode


java -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-jobtracker-home.log -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.mapred.JobTracker


java -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/bin/../logs -Dhadoop.log.file=hadoop-hadoop-tasktracker-home.log -Dhadoop.home.dir=/usr/lib/hadoop/bin/.. -Dhadoop.root.logger=INFO,DRFA -Djava.library.path=/usr/lib/hadoop/bin/../lib/native/Linux-amd64-64 -Dhadoop.policy.file=hadoop-policy.xml -classpath org.apache.hadoop.mapred.TaskTracker

once done, you can access http://home:50070, to see the cluster is running with two nodes.


More Hadoop Blogs

  1. How to: create a hadoop map/reduce job tutorial
  2. How to: install and config hadoop to run in fully distributed Mode, Centos
  3. How to : setup SSH authentication over keys, hadoop installation

Wednesday, February 16, 2011

How to : setup SSH authentication over keys, hadoop installation

When you install hadoop, you will required to create one user hadoop with group hadoop. and setup the SSH keys that it can login to the server without enter a password.

#Create one group hadoop
groupadd hadoop
#create one user hadoop, create home directory
useradd –g hadoop –-create-home hadoop
#login as hadoop
su – hadoop
#generate ssh public/private keys
ssh-keygen –t rsa –P ‘’
# cp the public key to authorized_keys
cat .ssh/  >> .ssh/authorized_keys
if you want to enable hadopp to access another server without login using ssh,
cat the  .put key to the server authorized_keys files
#remember to setup the permission on the authorized_keys , that only owner can read/write, 600 permission, if you missed this step, you will be asked for password, even you setup the pub key to be trusted.
chmod 600 .ssh/authorized_keys
Now, you can login to localhost without password

whole steps as captured.

More Hadoop Blogs

  1. How to: create a hadoop map/reduce job tutorial
  2. How to: install and config hadoop to run in fully distributed Mode, Centos
  3. How to : setup SSH authentication over keys, hadoop installation

Decode Oracle Coherence cache extend protocol, Tangosol.Net.RequestTimeoutException

on day, C# applications which use coherence cache extend get an Tangosol.Net.RequestTimeoutException. I spent a lot time to figure out what happened? what’s the potential reason of the requestimeout. whole stack trace

Tangosol.Net.RequestTimeoutException: request timed out after 30000 millis       at Tangosol.Net.Messaging.Impl.Request.RequestStatus.get_Response()

at Tangosol.Net.Messaging.Impl.Request.RequestStatus.WaitForResponse(Int64 millis)

at Tangosol.Net.Messaging.Impl.Channel.Request(IRequest request, Int64 millis)

at Tangosol.Net.Messaging.Impl.Channel.Request(IRequest request)

at Tangosol.Net.Impl.RemoteNamedCache.BinaryNamedCache.get_Item(Object key)

at Tangosol.Util.ConverterCollections.ConverterDictionary.get_Item(Object key)

at Tangosol.Net.Impl.RemoteNamedCache.get_Item(Object key)

at Tangosol.Net.Impl.SafeNamedCache.get_Item(Object key)

at CacheClient.CustomerProfile.button12_Click(Object sende

Question 1, is it a network problem? C# application will talk to proxy node (JVM on another machine), then proxy node talk to storage node to get the data back.
then I capture the network traffic between C# application and the Proxy jvm node. you can use tcpview to figure out which proxy node has the TCP connection established with the client.

the traffic looks good, request out, and response immediately get back. (so no firewall blackout, no package drop)


here, is the client running C# application. It send out the request to Proxy Node (
   then after 0.024 seconds, it get the response back.(client send out ack in package 7)

all looks great on network level. Then it comes to my second question, are those data valid in the return.


in this special case, the server returns 38 bytes array here it will be


what the encoding mechanism, I can only figure out 1st byte 25 which is 37 means that package data length is 38-1=37, how about the rest?

then I use the .net reflector to read the Coherence.dll. and figure out the raw format.


it used packed int32 format here.
I write a Simple c# program. here to decode the channel id and tyepid/versionid

String s =”25:90:d0:92:cb:05:00:02:00:6b:03:4c:18:15:a4:37:01:00:4e:04:74:65:73:74:01:42:cf:ff:ce:91:b8:f7:b1:b2:ee:01:40:40”;
byte[] data = new byte[(s.Length + 1) / 3];
for (int i = 0; i < data.Length; i++)
data[i] = (byte)
  ( "0123456789ABCDEF".IndexOf(s[i * 3]) * 16 + "0123456789ABCDEF".IndexOf(s[i * 3 + 1]));

MemoryStream ms = new MemoryStream(data);

DataReader dr = new DataReader(ms);
string msg = string.Format("Package Length  {0} \n Channel ID {1} \n TypeID {2}\n Version id {3} ",
dr.ReadPackedInt32(), dr.ReadPackedInt32(), dr.ReadPackedInt32(), dr.ReadPackedInt32());

When run the code, you will see the decoded value.

it you try decode the output package sending to proxy, it has the same naming format.

Client send a package with the Channel ID and receive the response with channelID, if the channel state is maintained incorrectly in the client side, you will get the exception>
here the reflector code for client  decode message.

if the channel =null or is closed, even server returns back the response. it will cause the requestimeoutexception.
stupid code;(

you can also do the inspecting on the runtime level.
Define one filter.

public class MyFilterDump : IWrapperStreamFactory
        public MyFilterDump()

        public System.IO.Stream GetInputStream(System.IO.Stream stream)
            System.Diagnostics.Debug.WriteLine("Get Response " + stream.Length);

            DataReader dr = new DataReader(stream);
            System.Diagnostics.Debug.WriteLine("Chennel ID " + dr.ReadPackedInt32());

            stream.Seek(0, SeekOrigin.Begin);
            return stream;

        public System.IO.Stream GetOutputStream(System.IO.Stream stream)
            System.Diagnostics.Debug.WriteLine("Send Request " + stream.Length);
            return stream;
            //did the same thing on input

then put it into client-coherence.xml and client-cache-control.xml

<coherence xmlns="">
<filter-class>ExtendLib.MyFilterDump, ExtendLib</filter-class>




Have fun@

Monday, February 14, 2011

How to: resize expand partition on Centos hosted by Vmplayer

when you limited size warrning, run “df-h”,  80% is used on /, we will span the /dev/mapper/volgroup00-logvol00 here.


display LVM , “lvdisplay”, there are one volum group volGroup00 with two logic volumes 00 and 01

just one disk with 10G, shut down the VM, and click the utility to expand the disk to 13G


run “fdisk /dev/had” to add a lvm partition to the new expended disk.
p –>print current partition
n-> add one new partition with primary partation
t-> toggle the partition type to 8e which is lvm


press W to save the partition

after done, Run Pvcreate to create a new LVM.


if you get error like disk us not found , just reboot the vm

Extend the VolGroup00 , add /dev/hda3 to the group, and extend the /dev/VolGroup00/LogVol00 to the new lvm hda3

run resize2fs , refresh the change.  run df –h again, you will see free space is increased.


Thursday, February 10, 2011

how to install and configure Nginx + PHP FPM

Building PHP with FPM Support,

# ./configure --enable-fpm --prefix=/opt/phpfpm

Make Install


then you can run the php-fpm by run “php-fpm ”, php-fpm is located in the sbin dir
if you get error like unable to find the conf file like following ,


just rename /opt/phpfmp/etc/php-fpm.conf.default as php-fpm.conf , change the min/max/startserver value
Also copy the /tmp/php-x/php.ini-production to /opt/phpfpm/php.ini
run netstat –an |grep 9000

Now install nginx following install and config Nginx/PhP fastcgi on Centos
change the nginx.conf to include fpm for php backend proxy processing. just uncomment the default conf settings,

location ~ \.php$ {
        root           html;
        fastcgi_index  index.php;
        fastcgi_param  SCRIPT_FILENAME /opt/nginx/html/$fastcgi_script_name;
        include        fastcgi_params;

restart nginx –s restart

then create one info.php and put it under /opt/ngnix/html

<?php phpinfo();?>

Now, you can access http://localhost/info.php, you will get php info page.


Here is the raw traffic between Nginx and PhpFPM

If you get 404 error no matter how, check the nginx.conf point to the absolute path of the script

fastcgi_param  SCRIPT_FILENAME /opt/nginx/html/$fastcgi_script_name;

Wednesday, February 9, 2011

How to : install and config Nginx/PhP fastcgi on Centos

To Install Nginx, setup the prerequisites as following.

yum install GCC #C compiler
yum install pcre* # perl regular expression
yum install zlib zlib-devel
yum install openssl openssl-devel

Download Nginx From, and extract to a source folder, I will use /tmp here

-zxvf nginx-0.8.54.tar.gz
cd /tmp/nginx-0.8.54

./configure  --prefix=/opt/nginx
make && make install

once done, you can run ./nginx –? to show all available options ,


for the default loading, it read the conf file which is  conf/nginx.conf 
to start the nginx, just run ./nginx, you will get the helloworld page.


here, I changed ngingx.conf a little with a New port 88, and basic proxy forward to Apache,PHP (check installation instructions, Install PHP to Apache web server)

server {
        listen       88;
        server_name  localhost;

        #charset koi8-r;

        #access_log  logs/host.access.log  main;

        location / {
            root   html;
            index  index.html index.htm;

        #error_page  404              /404.html;

        # redirect server error pages to the static page /50x.html
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;

        # proxy the PHP scripts to Apache listening on
       location ~ \.php$ {

if I access http://localhost:88/info.php, nginx will forward the request to http://localhost/info.php

But the server header is nginx Now. Also the ngnix has enabled the GZIP support now. ( a great reverse proxy, offloading those cut-off features)


more how-tos:

How to: install apache 2.2 on centos
How To: Install PHP to Apache web server

Tuesday, February 8, 2011

How To: Install PHP to Apache web server

After install apache 2.2 on centos , there is no PHP Module built-in, You have to do the same thing like installing apache, Download the source file , compile and put the modules to the folder that will be picked up by apache. Here is the rough steps,

  • Download and extract the tarball from,
    • tar –zxvf php-5.3.5.tar.gz
  • go into the uncompressed php folder and run the config
    • ./configure –with-apxs2=/opt/apache/bin/apxs –-prefix=/opt/php
    • if needed, yum install libevent-devel libxml2-devel
  • Compile and Test
    • make && make test
  • Install the module to the modules folder
    • make install
    • this will automatically upate apache’s conf/httpd.conf with
      LoadModule php5_module        modules/
      Add the handler of php manully.
      AddType application/x-httpd-php .php
  • restart the apache daemon

TO test, create a single line file to htdocs/test.php

<?php phpinfo();?>

when access that url , you will get the information of php settings, now you are good to go.

How to: Test the apache syslog with the free kiwi syslog server

Let’s say you have two OS, one in windows which will host the syslog server, and one Centos OS as the container of Apache , here is the tutorial to test the syslog logging. both on  server side (view the logs) and and client side (turn on logging to syslog)

Download and install one free version of kiwi syslog
  when you run, you will get the GUI to show and filter logs.


when the log server runs, it will listen on the UDP port 514.

on the Centos, OS level, to Point lone log facility to our kiwi log server

vi /etc/syslog.conf
#Add one line

localcal6.*        @

#restart syslog server
service syslog restart # or killall –HUP syslogd

Now, you can test the syslog setting now, by write several logs to syslog

logger –p Hello,errorlog

from the kiwi server, you will see the log

For the Apache, change the httpd.conf, to point the errorlog and accesslog to our syslog facility local6

#ErrorLog "logs/error_log"
ErrorLog syslog:local6

#access Log
CustomLog "|/usr/bin/logger -p"  combined

Restart the httpd , all logs will be dispatched to our syslog server. (here both error log and access log)


Monday, February 7, 2011

How to: install apache 2.2 on centos

It’s pretty straightforward to install apache 2.2 on centos. download the source file and compile it, then make install.

cd /tmp
tar zxvf httpd-2.2.17.tar.gz
then switch to the decompressed folder
./configure --prefix=/opt/apache
make install

After that, start the httpd, there is one apachectl utility to control the start/stop operation of httpd

/opt/apache/bin/apachectl start
#start the httpd daemon

then browse to localhost:80, you will get the It works page. (httpd is working and listens on port 80) 

you can change the index.html located in the /opt/apache/htdocs/ to something else like hello world.

run “ps -aux|grep httpd”, you will see the processes.

To add the httpd to  service.

ln -s /opt/apache/bin/apachectl  /etc/init.d/apache2
add two lines to the /etc/init.d/apache2
#chkconfig:3 80 80
#description:my apache web server

add it to the chkconfig
chkconfig –add apache2

now you can query chkconfig –list apache2


Sunday, February 6, 2011

linux putty SSH arrow key not working macbook ubantu, ^[[A

Unlike One Simple Windows, there are a lot ad-hoc on Linux. I get a strang behavior with ubantu this Moring, the arrow key to show the last command history never work any more. it shows the escaped encoding,

Always, when I press the left arrow, it returns the last command history,. But not, it returns ^[[A


First make sure which shell you are using, by run “echo $0”


I get a stupid shell –su now. now switch to the bash by run “/bin/bash”

all works now


is it tricky?

also we can add it to the default profile or .bashrc

Locations of visitors to this page