Multinode guide

Guide to multi-node training with Intel® Distribution of Caffe*

This is an introduction to multi-node training with Intel® Distribution of Caffe* framework. All other pages related to multi-node in this wiki are supplementary and they are referred to in this guide. By the end of it, you should understand how multi-node training was implemented in Intel® Distribution of Caffe* and be able to train any topology yourself on a simple cluster. Basic knowledge of BVLC Caffe usage might be necessary to understand it fully. Also be sure to check out the performance optimization guidelines.

Introduction

To make the practical part of this guide more comprehensible, the instructions assume you have configured from scratch a cluster comprising 4 nodes. You will learn how to configure such a cluster, how to compile Intel® Distribution of Caffe*, how to run a training of a particular model, and how to verify the network actually have trained.

How it works

In case you are not interested in how multi-node in Intel® Distribution of Caffe* works and just want to run the training, please skip to the practical part chapter of this Wiki.

Intel® Distribution of Caffe* is designed for both single-node and multi-node operation. Here, the multi-node part is explained.

There are two general approaches to parallelization. Data parallelism and model parallelism. The approach used in Intel® Distribution of Caffe* is the data parallelism.

Data parallelism

The data parallelization technique runs training on different batches of data on each of the nodes. The data is split among all nodes but the same model is used. It means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. For example a network is trained on 8 nodes. All of them have batch size of 128. The (total) batch size in a single iteration of the Stochastic Gradient Descent algorithm is 8*128=1024.

Intel® Distribution of Caffe* with MLSL offers two approaches for multi-node training:

Default - Caffe does Allreduce operation for gradients and then each node is doing SGD locally, followed by Allgather for weights increments.
Distributed weights update - Caffe does Reduce-Scatter operation for gradients, then each node is doing SGD locally, followed by Allgather for weights increments.

Distribution of data

One approach is to divide your training data set into disjoint subsets of roughly equal size. Distribute each subset into each node used for training. Run the multinode training with data layer prepared accordingly, which means either preparing separate proto configurations or placing each subset in exactly the same path for each node.

An easier approach is to simply distribute the full data set on all nodes and configure data layer to draw different subset on each node. Remember to set shuffle:true for the training phase in prototxt. Since each node has its own unique randomizing seed, it will effectively draw unique image subset.

Communication

Intel® Distribution of Caffe* is utilizing Intel® Machine Learning Scaling Library (MLSL) which provides communication primitives for data parallelism and model parallelism, communication patterns for SGD and its variants (AdaGrad, Momentum, etc), distributed weight update. It is optimized for Intel® Xeon® and Intel® Xeon Phi (TM) processors and supports Intel® Omni-Path Architecture, Infiniband and Ethernet. Refer to MLSL Wiki or "MLSL Developer Guide and Reference" for more details on the library.

Snapshots

Snapshots are saved only by the node hosting the root process (rank number 0). In order to resume training from a snapshot the file has to be populated across all nodes participating in a training.

Test phase during training

If test phase is enabled in the solver’s protobuf file all the nodes are carrying out the tests and results are aggregated by Allreduce operation. The validation set needs to be present on every machine which have test phase specified in solver protobuf file. This is important because when you want to use the same solver file on all machines instead of working with multiple protobuf files you need to remember about that.

Configuring Cluster for Intel® Distribution of Caffe*

This chapter explains how to configure a cluster, and what components to install in order to build Intel® Distribution of Caffe* to start distributed training using Intel® Machine Learning Scaling Library.

Hardware and software configuration

Hardware assumptions for this guide: 4 machines with IP addresses in the range from 192.161.32.1 to 192.161.32.4 up the cluster Start from fresh installation of CentOS 7.2 64-bit. The OS image can be downloaded free of charge from the official website. Minimal ISO is enough. You should install the OS on each node (all 4 in our example). Next upgrade to the latest version of packages (do it on each node):

# yum upgrade

TIP: You can also execute yum -y upgrade to suppress the prompt asking for confirmation of the operation (unattended upgrade).

Preparing the system

Before installing Intel® Distribution of Caffe* you need to install prerequisites. Start by choosing the master machine (e.g. 192.161.32.1 in our example).

On each machine install “Extra Packages for Enterprise Linux”:

# yum install epel-release
# yum clean all

On master machine install "Development Tools" and ansible:

# yum groupinstall "Development Tools"
# yum install ansible

Configuring ansible and ssh

Configure ansible's inventory on master machine by adding sections ourmaster and ourcluster in /etc/ansible/ hosts and fill in slave IPs:

[ourmaster]
192.161.31.1
[ourcluster]
192.161.32.[2:4]

On each slave machine configure SSH authentication using master machine’s public key, so that you can log in with ssh without a password. Generate RSA key on master machine:

$ ssh-keygen -t rsa

And copy the public part of the key to slave machines:

$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.3
$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.4

Verify ansible works by running ping command from master machine. The slave machines should respond.

$ ansible ourcluster -m ping

Example output:

192.168.31.2 | SUCCESS => {
    “changed“: false,
    “ping“: “pong“
}
192.168.31.3 | SUCCESS => {
    “changed“: false,
    “ping“: “pong“
}
192.168.31.4 | SUCCESS => {
    “changed“: false,
    “ping“: “pong“
}

Master machine can also ping itself by ansible ourmaster -m ping and entire inventory by ansible all -m ping.

Installing tools

On master machine use ansible to install packages listed by running the command below for the entire cluster.

# ansible all -m shell -a 'yum -y install python-devel boost boost-devel cmake numpy \
  numpy-devel gflags gflags-devel glog glog-devel protobuf protobuf-devel hdf5 \
  hdf5-devel lmdb lmdb-devel leveldb leveldb-devel snappy-devel opencv opencv-devel'

Optionally you can install additional system tools you may find usefull.

# ansible all -m shell -a 'yum install -y mc cpuinfo htop tmux screen iftop iperf \
  vim wget'

You might be required to turn off the firewall on each node (refer to Firewalls and MPI for more information), too.

# ansible all -m shell -a 'systemctl stop firewalld.service'

The cluster is ready to deploy binaries of Intel® Distribution of Caffe*. Let’s build it now.

Building Intel® Distribution of Caffe*

This chapter explains how to build Intel® Distribution of Caffe* for multi-node (distributed) training of neural networks.

Getting Intel® Distribution of Caffe* Source Code

On master machine execute the following git command in order to obtain the latest snapshot of Intel® Distribution of Caffe* including multi-node support for distributed training.

$ git clone https://github.com/intel/caffe.git intelcaffe

Note: Build of Intel® Distribution of Caffe* will trigger Intel® Math Kernel Library for Machine Learning (MKLML) and Machine Learning Scaling Library (MLSL) to be downloaded to the intelcaffe/external/mkl/ and intelcaffe/external/mlsl directory and automatically configured.

Building from Makefile

This section covers only the portion required to build Intel® Distribution of Caffe* with multi-node support using Makefile. Please refer to Caffe documentation for general information on how to build Caffe using Makefile.

Start by changing work directory to the location where Intel® Distribution of Caffe* repository have been downloaded (e.g. ~/intelcaffe).

$ cd ~/intelcaffe

Make a copy of Makefile.config.example, and name it Makefile.config

$ cp Makefile.config.example Makefile.config

Open Makefile.config in your favorite editor and uncomment USE_MLSL variable.

# Intel(r) Machine Learning Scaling Library (uncomment to build with MLSL)
USE_MLSL := 1

Execute make command to build Intel® Distribution of Caffe* with multi-node support.

$ make -j <number_of_physical_cores> -k

Building from CMake

This section covers only the portion required to build Intel® Distribution of Caffe* with multi-node support using CMake. Please refer to Caffe documentation for general information on how to build Caffe using CMake. Start by changing work directory to the location where Intel® Distribution of Caffe* repository have been downloaded (e.g. ~/intelcaffe).

$ cd ~/intelcaffe

Create build directory and change work directory to build directory.

$ mkdir build
$ cd build

Execute the following CMake command in order to prepare the build

$ cmake .. -DBLAS=mkl -DUSE_MLSL=1 -DCPU_ONLY=1

Execute make command to build Intel® Distribution of Caffe* with multi-node support.

$ make -j <number_of_physical_cores> -k
$ cd ..

Populating Caffe Binaries across Cluster Nodes

After successful build synchronize intelcaffe directories on the slave machines.

$ ansible ourcluster -m synchronize -a ‘src=~/intelcaffe dest=~/’

Running Multi-node Training with Intel® Distribution of Caffe*

Instructions on how to train CIFAR10 and GoogLeNet are explained in more details in Multi-node CIFAR10 tutorial and Multi-node GoogLeNet tutorial. It is recommended to do CIFAR10 tutorial before you proceed. Here, the GoogLeNet will be trained on 4 node cluster. If you want to learn more about GoogLeNet training see the tutorial mentioned above as well.

Before you can train anything you need to prepare the dataset. It is assumed that you have already downloaded the ImageNet training and validation datasets, and they are stored on each node in /home/data/imagenet/train directory for training set and /home/data/imagenet/val for validation set. For details you can look at the Data Preparation section of BVLC Caffe examples at http://caffe.berkeleyvision.org/gathered/examples/imagenet.html. You can use your own data sets as well.

Next step is to create machine file ~/mpd.hosts on master node for controlling the placement of MPI process across the machines:

192.161.32.1
192.161.32.2
192.161.32.3
192.161.32.4

Update your model file models/bvlc_googlenet/train_val_client.prototxt:

 name: "GoogleNet"
 layer {
   name: "data"
   type: "ImageData"
   top: "data"
   top: "label"
   include {
   phase: TRAIN
   }
   transform_param {
   mirror: true
   crop_size: 224
   mean_value: 104
   mean_value: 117
   mean_value: 123
   }
   image_data_param {
   source: "/home/data/train.txt"
   batch_size: 256
   shuffle: true
   }
 }
 layer {
 name: "data"
 type: "ImageData"
 top: "data"
 top: "label"
 include {
 phase: TEST
   }
   transform_param {
   crop_size: 224
   mean_value: 104
   mean_value: 117
   mean_value: 123
   }
   image_data_param {
   source: "/home/data/val.txt"
   batch_size: 50
   new_width: 256
   new_height: 256
   }
 }

Synchronize the intelcaffe directories, change your working directory to intelcaffe.

Configure Intel® Machine Learning Scaling Library.

$ source external/mlsl/l_mlsl_2017.1.016/intel64/bin/mlslvars.sh

And start the training process with the following command:

$ mpirun -n 4 -ppn 1 -machinefile ~/mpd.hosts ./build/tools/caffe train \
  --solver=models/bvlc_googlenet/solver_client.prototxt --engine=MKL2017 2>&1 | tee -i ~/intelcaffe/multinode_train.out

Log from the training process will be written to multinode_train.out file.

Test the trained network

When the training is finished, you can test how your network has trained with the following command:

$ ./build/tools/caffe test --model=models/bvlc_googlenet/train_val_client.prototxt 
  --weights=multinode_googlenet_iter_100000.caffemodel --iterations=1000

Look at the bottom lines of output from the above command which contains loss3/top-1 and loss3/top-5. The values should be around loss3/top-1 = 0.69 and loss3/top-5 = 0.886.

For more information about caffe test visit Caffe interfaces website at http://caffe.berkeleyvision.org/tutorial/interfaces.html.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly