Tuesday, March 13, 2012

Cloudera's Hadoop Demo VM with VirtualBox: Running WordCount.java on the VM, a Step-by-Step Tutorial for Beginners


NOTE: Work in progress, please report any errors while following this tutorial in the comments! (v2.0)

I found setting up and running Hadoop on the free Cloudera VM very frustrating. For a while, I was stuck with a problem where map and reduce were both stuck at 0%, and the VM would eventually crash. I found out that it was due to a heap space error, but that didn't help as I was running mapreduce on two text files, each consisting of one line.

After a lot of trial and error, and reading of misleading, meandering, and incomplete "tutorials" online, I've written my own which I hope will help anyone trying to run Hadoop using the Cloudera VM without any problems. I did learn a lot and get familiar with the terminal again, so it wasn't a waste of time...but I spent an embarrassing amount of time searching for the cause of my failure to run WordCount - when it was all due to incorrect setup.

All commands are to be typed AS IS, unless there is text in this format: .

I guess it's still a work in progress in terms of clarity and beginner-friendliness, and feedback is welcomed.


Tutorial 1.0
VirtualBox 4.1.8
Cloudera VM image download date: 17/02/2012, size: 3.64GB

Prerequisite knowledge
Background knowledge of MapReduce and Hadoop is needed. Basic knowledge of Java is also needed to understand the WordCount example.

It is assumed that you have basic familiarity with using Linux OS and the terminal, especially commands such as: cd, ls, man, rm, mv, cp, etc. Googling what you're trying to do or the commands mentioned will generally yield the command you're looking for. Using the Tab key for autocompletion will save a lot of time and effort.

If you get "stuck" inside a command or want to terminate any running command, use Ctrl + C.

Install either VirtualBox, KVM, or VMware (compatible with Mac), and download the appropriate image from the Cloudera Hadoop Demo VM page.


ADVANCED: You may use WinMD5Sum to check the downloaded image with the hashes provided.


◇ ◇

Start up VirtualBox: create new virtual machine using the New button.
Type the name for the VM, and choose Linux and Ubuntu.
Set the amount of memory dedicated to the VM. Do NOT drag the slider below 512MB. Choose Use existing hard disk, select the image on your local disk, finish.
DO NOT START IT YET. Go to Settings > System, tick Enable IO APIC.

VMware does not appear to need any additional setup; just run the image using VMware Player.

◇ ◇


Start the VM.
Once the VM has booted up, open up the terminal. You are using the CentOS distro (distribution) of Linux, Xfce desktop environment.



$ sudo -s

This gives permanent root privileges. If you do not do this, you must prepend most commands with sudo.


Input the following commands (yum is like apt-get). Prepend commands with sudo if they do not seem to work, or require permissions (unless you've entered sudo -s

$ yum update
$ yum install gcc
$ yum install kernel-devel

◇ ◇


Open a web browser and download the latest stable release of Hadoop here. For this tutorial, I downloaded hadoop-1.0.0/ (15-Dec-2011 16:51).

Save the tar.gz archive to your Desktop.

Move it to /usr/local and untar it using the commands below:

$ mv
hadoop-1.0.0.tar.gz /usr/local
$ cd /usr/local
$ tar xzf hadoop-1.0.0.tar.gz

This command should now show the Hadoop commands:

$ hadoop

Check where Java is installed, and which version, with:
$ which java
$ java -version



vi is a text editor within the terminal. You can use emacs, vim, etc., but vi is used throughout this tutorial.

After the command

$ vi

is entered, you are in the vi editor.

Hit the I key to start typing (INSERT mode).
Hit Esc to get out of any mode (e.g. INSERT mode).
When not in any mode, type :wq to write (save) and quit. In case of conflict, type :q! to force quit (changes may not be saved).

See this page for a more comprehensive guide to vi commands.


Go to /usr/local/hadoop-1.0.0/conf (note: it is NOT the directory /bin/hadoop-1.0.0) to change the JAVA_HOME variable in hadoop-env.sh to the information you just displayed (remember to uncomment it by removing the "#"), for example, at the time of writing:

$ cd /usr/local/hadoop-1.0.0/conf

Make sure to use sudo, since the file is set to readonly.

$ sudo vi hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.6.0_21


Try this command, it should display information about hadoop:

$ hadoop


Troubleshooting note: If there is ANY problem with the bin/hadoop command, i.e. it returns "No such directory", then check the path of the java jdk, check that you've uncommented the line in hadoop-env.sh. It MUST be /usr/local/hadoop-1.0.0, because the directory /bin/hadoop-1.0.0 is NOT checked.


Open bashrc:

$ vi ~/.bashrc

Paste these lines into bashrc:

export HADOOP_HOME=/usr/lib/hadoop-0.20/
export HADOOP_VERSION=0.20.2-cdh3u3


◇ ◇

Make these directories on the Desktop:


Inside /myorg, make WordCount.java, pasted from this page.

Compile WordCount.java:
$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar WordCount.java

ssh into localhost:
$ ssh localhost

cd to /usr/local/hadoop-1.0.0.
$ cd /usr/local/hadoop-1.0.0

Make the HDFS directory used in the tutorial:
$ bin/hadoop dfs -mkdir /usr/joe/wordcount/input/

Use this to check that the input folder is there:

$ bin/hadoop dfs -ls /usr/joe/wordcount/

Make the two input files used in the tutorial locally somewhere (e.g. in the LOCAL wordcount folder:

$ vi file01

Containing one line: "Hello World Bye World"

$ vi file02

Containing one line: "Hello Hadoop Goodbye Hadoop"

cd back to /usr/local/hadoop-1.0.0.

Put them on the HDFS:

$ bin/hadoop dfs -put /home/cloudera/Desktop/file01 /usr/joe/wordcount/input
$ bin/hadoop dfs -put
/home/cloudera/Desktop/file02 /usr/joe/wordcount/input

◇ ◇


Go to /Desktop/wordcount/ and make a jar (should be local) from the compiled WordCount.java:
$ jar -cvf wordcount.jar -C wordcount_classes/ .

View contents of jar:
$ jar tf wordcount.jar

Run the jar:
$ sudo bin/hadoop jar /home/cloudera/Desktop/wordcount/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output


Note: If you get a FileAlreadyExistsException relating to /usr/joe/wordcount/output, it means that the HDFS output directory /usr/joe/wordcount/output must be deleted with the following command before running the program again:

$ sudo bin/hadoop dfs -rmr /usr/joe/wordcount/output

If any other error comes up while using bin/hadoop, try prepending sudo to the command, or give permanent (session) root privileges.

Also, check paths with:
$ pwd

as you may have chosen different locations for files; do not blindly follow this tutorial! Keep in mind that versions may also change with updates and new releases, e.g. Java, Hadoop, etc.


Check the logs using the web interface:


Tutorials followed:



Other commands that may be useful:

$ bin/hadoop job -list-active-trackers

Scripts to restart everything:
$ /usr/lib/hadoop-0.20/bin/stop-all.sh
$ /usr/lib/hadoop-0.20/bin/start-all.sh


  1. I'm having some trouble.

    I followed successfully through installing Hadoop 0.20.2 in /bin using the .tar.gz download.

    The next part confuses me. You tell us to find /usr/local/hadoop-1.0.0/conf/hadoop-env.sh
    Unfortunately I don't have a hadoop-1.0.0 (or any other hadoop directory) in /usr/local.
    Instead I found usr/lib/hadoop-0.20/conf/hadoop-env.sh
    I tried changing the JAVE_HOME variable in this file to a variety of things, but none of them make the bin/hadoop command work. It tells me "no such file or directory."

    When I enter "which java" the result is:

    When I enter "jave -version" the result is:
    java version "1.6.0_21"

    Any help is appreciated.

    1. Hi, thank you for using my tutorial and providing debug info.

      It appears that my steps may be wrong, I apologise.

      Instead of installing Hadoop 0.20, which is a legacy version, please download 1.0.0 which is a stable version, and untar in /usr/local. Then continue with the rest of the steps.

      I will amend my post, thanks again for pointing that out. Let me know again if it doesn't work. The reason I write "/usr/local/hadoop-1.0.0/conf/hadoop-env.sh" is because I downloaded Hadoop version 1.0.0 - tailor that to correspond with your version of Hadoop, if you had downloaded a different version.

    2. Okay thanks. I thought that might be the case with v0.20 and v1.0.0.

      In the tutorial it looks like you instruct us to untar the latest version in /bin instead of /usr/local
      Am I right that we should only install it in /usr/local and not /bin ?

      I'm also a little confused about the bin/hadoop command. For me it only works when I'm in the /usr/local/hadoop-1.0.0 directory. Your tutorial makes it sound like the bin/hadoop commands should work no matter what directory I'm in. Can you clarify? Thanks for all the help.

    3. Yes, install it in usr/local. What happens when you enter the command "hadoop" instead of "bin/hadoop"?

      Apologies for the confusion, I will replicate the problem and follow the tutorial with a clean image today, amend it, and get back to you. I had written the tutorial after lots of hacking things together :)

    4. Hi, I have set up a new VM to replicate your problem, and made some changes to the tutorial. It should be 100% correct now! bin/hadoop will only work in usr/local/hadoop-1.0.0, so you must remain in that location to run your MapReduce program. Good luck!