I started this project to be able to set up a simple, working Hadoop environment in minutes and be able recreate the environment without any hassle when I messed up. So I ended up with Vagrant:
Vagrant provides easy to configure, reproducible, and portable work environments built on top of industry-standard technology and controlled by a single consistent workflow to help maximize the productivity and flexibility of you and your team. - Why Vagrant?
You can find the code and README at the GitHub repo hadoop-single-node-vagrant.
Setting up the single node hadoop environment is as easy as:
$ git clone https://github.com/baswenneker/hadoop-single-node-vagrant
$ vagrant up
The first command creates a folder hadoop-single-node-vagrant in the current directory and downloads the project files from the git repository. The vagrant up
command provisions the Hadoop environment.
During the provisioning process we created a user called hduser which we use to execute Hadoop commands. To use the box we have to ssh into it using:
$ vagrant ssh -- -l hduser
The password of hduser is hduser
.
You're good to go!
Of course you want to see some action. We'll use the Hadoop wordcount example to show off to your friends.
First things first, let's start HDFS and Yarn.
$ start-dfs.sh
$ start-yarn.sh
To check if all nodes are up and running use jps
and see if the output is about the same as below:
$ jps
11261 NameNode
11842 Jps
11365 DataNode
11813 NodeManager
11708 ResourceManager
11542 SecondaryNameNode
You can also check the health of Hadoop by browsing to http://192.168.33.10:50070/ on the host machine.
Create the directory that contains the input files of which the words are counted (-p creates the full path). For more information about the HDFS Shell commands, see the Hadoop File System Shell Guide
$ hdfs dfs -mkdir -p /tmp/testing/wordcount_in
Create a sample text file that is counted:
$ echo "Hello World <> Hello Hadoop" >> sample.txt
Copy the sample text to wordcount_in folder we just created on the HDFS filesystem.
$ hdfs dfs -copyFromLocal sample.txt /tmp/testing/wordcount_in/
Just to make sure, check if your file is copied, use:
$ hdfs dfs -ls /tmp/testing/wordcount_in
Found 1 items
-rw-r--r-- 1 hduser supergroup 38 2014-07-09 09:48 /tmp/testing/wordcount_in/sample.txt
Now we're ready to let Hadoop take care of counting the words:
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /tmp/testing/wordcount_in /tmp/testing/wordcount_out
You should be able to beat Hadoop by doing the wordcount yourself, but hey you're a geek and Hadoop is awesome, so we use Hadoop. While you're waiting for Hadoop to finish, why not check the cluster status http://192.168.33.10:8088/cluster?
To check the results afterwards:
$ hdfs dfs -ls /tmp/testing/wordcount_out/
Found 2 items
-rw-r--r-- 1 hduser supergroup 0 2014-07-09 11:11 /tmp/testing/wordcount_out/_SUCCESS
-rw-r--r-- 1 hduser supergroup 84 2014-07-09 11:11 /tmp/testing/wordcount_out/part-r-00000
Now let's see if Hadoop came up with the right answer.
$ hdfs dfs -cat /tmp/testing/wordcount_out/part-r-00000
<> 1
Hadoop 1
Hello 2
World 1
Wow!
If you messed up the box you can destroy and recreate the box by entering the following commands on the host machine:
$ vagrant destroy
$ vagrant up
In order to save bandwidth and time the provisioning script downloads and store the Hadoop tarball in the shared directory (project folder on the host machine and /vagrant on the guest machine). If the download fails for some reason, delete the tarball and rerun vagrant provision.
You might get the following warning message every now and then:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
This does no harm and can be ignored. For a fix, see http://stackoverflow.com/questions/19943766/hadoop-unable-to-load-native-hadoop-library-for-your-platform-error-on-centos.