Thursday, February 2, 2012

Apache Hadoop Single Node Standalone Installation Tutorial


When you implement Apache Hadoop in production environment, you’ll need multiple server nodes. If you are just exploring the distributed computing, you might want to play around with Hadoop by installing it on a single node.
This article explains how to setup and configure a single node standalone Hadoop environment. Please note that you can also simulate a multi node Hadoop installation on a single server using pseudo distributed hadoop installation, which we’ll be covering in detail in the next article of this series.

The standlone hadoop environment is a good place to start to make sure your server environment is setup properly with all the pre-req to run Hadoop.

1. Create a Hadoop User

You can download and install hadoop on root. But, it is recommended to install it as a separate user. So, login to root and create a user called hadoop.
# adduser hadoop
# passwd hadoop

2. Download Hadoop Common

Download the Apache Hadoop Common  and move it to the server where you want to install it.
You can also use wget to download it directly to your server using wget.
# su - hadoop
$ wget http://mirror.nyi.net/apache//hadoop/common/stable/hadoop-0.20.203.0rc1.tar.gz
Make sure Java 1.6 is installed on your system.
$ java -version
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.7) (rhel-1.39.1.9.7.el6-x86_64)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

3. Unpack under hadoop User

As hadoop user, unpack this package.
$ tar xvfz hadoop-0.20.203.0rc1.tar.gz
This will create the “hadoop-0.20.204.0″ directory.
$ ls -l hadoop-0.20.204.0
total 6780
drwxr-xr-x.  2 hadoop hadoop    4096 Oct 12 08:50 bin
-rw-rw-r--.  1 hadoop hadoop  110797 Aug 25 16:28 build.xml
drwxr-xr-x.  4 hadoop hadoop    4096 Aug 25 16:38 c++
-rw-rw-r--.  1 hadoop hadoop  419532 Aug 25 16:28 CHANGES.txt
drwxr-xr-x.  2 hadoop hadoop    4096 Nov  2 05:29 conf
drwxr-xr-x. 14 hadoop hadoop    4096 Aug 25 16:28 contrib
drwxr-xr-x.  7 hadoop hadoop    4096 Oct 12 08:49 docs
drwxr-xr-x.  3 hadoop hadoop    4096 Aug 25 16:29 etc
Modify the hadoop-0.20.204.0/conf/hadoop-env.sh file and make sure JAVA_HOME environment variable is pointing to the correct location of the java that is installed on your system.
$ grep JAVA ~/hadoop-0.20.204.0/conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_27

4. Test Sample Hadoop Program

In a single node standalone application, you don’t need to start any hadoop background process. Instead just call the ~/hadoop-0.20.203.0/bin/hadoop, which will execute hadoop as a single java process for your testing purpose.
This example program is provided as part of the hadoop, and it is shown in the hadoop document as an simple example to see whether this setup work.
First, create a input directory, where all the input files will be stored. This might be your location where all the incoming data files will be stored in the hadoop environment.
$ cd ~/hadoop-0.20.204.0
$ mkdir input
For testing purpose, add some sample data files to the input directory. Let us just copy all the xml file from the conf directory to the input directory. So, these xml file will be considered as the data file for the example program.
$ cp conf/*.xml input
Execute the sample hadoop test program. This is a simple hadoop program that simulates a grep. This searches for the reg-ex pattern “dfs[a-z.]+” in all the input/*.xml file and stores the output in the output directory.
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
When everything is setup properly, the above sample hadoop test program will display the following messages on the screen when it is executing it.
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
12/01/14 23:38:46 INFO mapred.FileInputFormat: Total input paths to process : 6
12/01/14 23:38:46 INFO mapred.JobClient: Running job: job_local_0001
12/01/14 23:38:46 INFO mapred.MapTask: numReduceTasks: 1
12/01/14 23:38:46 INFO mapred.MapTask: io.sort.mb = 100
12/01/14 23:38:46 INFO mapred.MapTask: data buffer = 79691776/99614720
12/01/14 23:38:46 INFO mapred.MapTask: record buffer = 262144/327680
12/01/14 23:38:46 INFO mapred.MapTask: Starting flush of map output
12/01/14 23:38:46 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/01/14 23:38:47 INFO mapred.JobClient:  map 0% reduce 0%
...
This will create the output directory with the results as shown below.
$ ls -l output
total 4
-rwxrwxrwx. 1 root root 11 Aug 23 08:39 part-00000
-rwxrwxrwx. 1 root root  0 Aug 23 08:39 _SUCCESS

$ cat output/*
1       dfsadmin
The source code of the example programs are located under src/examples/org/apache/hadoop/examples directory.
$ ls -l ~/hadoop-0.20.204.0/src/examples/org/apache/hadoop/examples
-rw-rw-r--. 1 hadoop hadoop  2395 Jan 14 23:28 WordCount.java
-rw-rw-r--. 1 hadoop hadoop  8040 Jan 14 23:28 Sort.java
-rw-rw-r--. 1 hadoop hadoop  9156 Jan 14 23:28 SleepJob.java
-rw-rw-r--. 1 hadoop hadoop  7809 Jan 14 23:28 SecondarySort.java
-rw-rw-r--. 1 hadoop hadoop 10190 Jan 14 23:28 RandomWriter.java
-rw-rw-r--. 1 hadoop hadoop 40350 Jan 14 23:28 RandomTextWriter.java
-rw-rw-r--. 1 hadoop hadoop 11914 Jan 14 23:28 PiEstimator.java
-rw-rw-r--. 1 hadoop hadoop   853 Jan 14 23:28 package.html
-rw-rw-r--. 1 hadoop hadoop  8276 Jan 14 23:28 MultiFileWordCount.java
-rw-rw-r--. 1 hadoop hadoop  6582 Jan 14 23:28 Join.java
-rw-rw-r--. 1 hadoop hadoop  3334 Jan 14 23:28 Grep.java
-rw-rw-r--. 1 hadoop hadoop  3751 Jan 14 23:28 ExampleDriver.java
-rw-rw-r--. 1 hadoop hadoop 13089 Jan 14 23:28 DBCountPageView.java
-rw-rw-r--. 1 hadoop hadoop  2879 Jan 14 23:28 AggregateWordHistogram.java
-rw-rw-r--. 1 hadoop hadoop  2797 Jan 14 23:28 AggregateWordCount.java
drwxr-xr-x. 2 hadoop hadoop  4096 Jan 14 08:49 dancing
drwxr-xr-x. 2 hadoop hadoop  4096 JAn 14 08:49 terasort

5. Troubleshooting Issues

Issue: “Temporary failure in name resolution”
While executing the sample hadoop program, you might get the following error message.
12/01/14 23:34:57 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-root/mapred/staging/root-1040516815/.staging/job_local_0001
java.net.UnknownHostException: hadoop: hadoop: Temporary failure in name resolution
        at java.net.InetAddress.getLocalHost(InetAddress.java:1438)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:815)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
        at java.security.AccessController.doPrivileged(Native Method)
Solution: Add the following entry to the /etc/hosts file that contains the ip-address, FQDN fully qualified domain name, and host name.
192.168.1.10 hadoop.sureshkumarpakalapati.in hadoop