Monday, October 13, 2008

Installing Hadoop on Ubuntu

Have a cluster of 4 machines and am set to install hadoop on it.
ds16: master
ds11: slave
ds04: slave
ds14: slave

Am loosely following sangmi's blog and the following

Step 1: chk if you have ssh, rsync and java on all machines
For java
sudo apt-get install sun-java5-jdk

Step 2:
Download hadoop and in conf/

export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun

Step 3:
now you can just run your standalone operation as it is.
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

[One problem faced was - when I did ssh to my machines, I get output

id: cannot find name for group ID 521

So I added a group with id 521 that solved the problem
sudo addgroup -gid 521 test

Also disabled stricthostchecking in /etc.ssh/ssh_config -- causing problem since ~ is mounted on nfs]
Till now we have run hadoop on single node. Lets move to a cluster now.

Cluster Operations:
Step 4:
Configure conf/hadoop-site.xml on the namenode as follows

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<description>A base for other temporary directories.</description>

Edit conf/masters and conf/slaves on ds16 as shown

$ cat conf/masters

$ cat conf/slaves

[Used for converting code to html .. its nice]

[Make sure there are no leading spaces in the file -- gives a weird XML exception
[Fatal Error] hadoop-site.xml:2:6: The processing instruction target matching "[xX][mM][lL]" is not allowed.]

Step 5:

$mkdir /export/pathaka/hadoop/dfs
# This is the dfs file tree

$ mkdir /export/pathaka/hadoop/tmp

On master - ds16 run
$ bin/hadoop namenode -format
# This formats the dfs folder
#Do it only the first time and on master only

On ds16 - master, run
$ bin/
$ bin/
# master and jobtracker are the same machine

$ bin/
$ bin/

