I'm falling in love with Hadoop and Docker. I want to leverage them to "containerize" the Hadoop cluster. I create this project to build and run Hadoop modules inside Docker containers. It is just for practical purpose and not tested in the production environment.
If you wish to customize the source code and push the images to your Docker repo for future use, you will need to change vietanh85
(my Docker account) to yours in the docker-compose.*.yml
files. You will see the image
property of the services.
For this practice, I'm going to use 2 modules of Hadoop system:
- Yarn for node/resource management
- HDFS for storage.
All of them could be downloaded in the same package of Hadoop, the only different thing is the starting script. To save my code and effort, I decide to create a base images for all modules. By doing that, Docker engine will not have to download hadoop package every time it build the images. The hierarchy of our images will be as bellow:
You may know that docker-compose
is a great tool to define your services with dependencies and wire them up together with a simple command docker-compose up
.
If I use the same docker-compose.yml
file for both build and run purpose, docker-compose
will automatically wire up all the services including the base-images (onbuild) which I don't want to start it up. I decide to create the docker-compose.build.yml
just for build purpose.
To automatically build the images, you can just simply run this command and docker-compose
will take care of the rest:
docker-compose -f docker-compose.build.yml build
To check the build result, you can run docker images
, you will see your new images there.
You can run your Pseudo-Distributed Operation Hadoop system by running this command:
docker-compose -f docker-compose.pseudo.yml up
Now, your Pseudo-Distributed Hadoop system will be ready with HDFS and Yarn up and running inside a single container. If you run docker ps
you will see there is one new container has been started. To access to your HDFS Name Node web interface, you can go to http://localhost:50070
. To access to Resource Manager, you can go to http://localhost:8088
.
[IMG]
By default, docker-compose will set your container name to hadoopdocker_hadoop_pseudo_1
, to see your container name, you can run docker ps
. Bellows are the steps to test your containers:
# Make the HDFS directories required to execute MapReduce jobs
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hdfs dfs -mkdir -p /user/root/input"
# Copy the input files into the distributed filesystem
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hdfs dfs -put etc/hadoop/*.xml input"
# Run some of the examples provided:
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+'"
# View the output files on the distributed filesystem:
$ docker exec hadoopdocker_hadoop_pseudo_1 bash -c "hdfs dfs -cat output/*"
Now it's the time to run your Hadoop system in "fully" distributed mode. I put the word "fully" inside double-quotes, since it is not a real fully distributed system which run in multiple hosts. Instead of that, we will run our cluster inside multiple containers and all of them will run in one host.
[IMG]
To start your Hadoop Cluster, you can run this command:
docker-compose -f docker-compose.cluster.yml up
docker-compose will start 3 containers including:
- HDFS (name node)
- Yarn (both resource and node manager)
- HDFS (data node)
To see your containers, run docker ps
.
You can easily scale your data nodes using docker-compose as well:
docker-compose -f docker-compose.cluster.yml scale hdfs_data=3
Issue: moby/swarmkit#939
TBD
- Run Pseudo-Distributed Operation
- Run Cluster Operation
- Use Docker Swarm to run and deploy in multiple hosts