Skip to content
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.

Recommendations to achieve best performance

Pawel Noga edited this page Oct 27, 2016 · 29 revisions

To achieve best performance with Intel® Distribution of Caffe* on Intel CPU please apply the following recommendations:

Hardware / BIOS configuration:

  • Make sure that your hardware configurations includes fast SSD (M.2) drive. If during trainings/scoring you will observe in logs "waiting for data" - you should install better SSD or reduce batchsize.
  • With Intel Xeon Phi™ product family - enter BIOS (MCDRAM section) and set MCDRAM mode as cache
  • Enable Hyper-treading (HT) on your platform - those setting can be found in BIOS (CPU section).
  • Optimize hardware in BIOS: set CPU max frequency, set 100% fan speed, check cooling system.
  • For multinode Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use:
Processor C6 = Enabled
Snoop Holdoff Count = 9
Intel Turbo Boost Technology = Enabled
Uncore settings: Cluster Mode: All2All

Software / OS configuration:

  • With Intel Xeon Phi™ product family - it is recommended to use Linux Centos 7.2 or newer
  • It is recommended to use newest XPPSL software for Intel Xeon Phi™ product family: [https://software.intel.com/en-us/articles/xeon-phi-software#downloads] (https://software.intel.com/en-us/articles/xeon-phi-software#downloads)
  • For multinode Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use:

irqbalance needs to be installed and configured with --hintpolicy=exact option

CPU frequency needs to be set via intel_pstate driver:


echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo

cpupower frequency-set -g performance```

* Make sure that there are no unnecesary processes during traning and scoring. IntelCaffe is using all available resources and other processes (like monitoring tools, java processes, network trafic etc.) might impact performance.
* We recommend to compile Caffe with gcc 4.8.5 (or newer)
* We recommend to compile Caffe with makefile.configuration set to:
```ini
CPU_ONLY := 1
USE_MKL2017_AS_DEFAULT_ENGINE := 1
BLAS := mkl

Caffe / Hyper-Parameters configuration:

  • Change prototxt file with network topology to Intel MKL's optimized versions. Caffe includes optimized (for Intel MKL2017) versions of popular prototxt files. Those files have specific engines set for each layer.

  • Use LMDB data layer (Using ‘Images’ layer as data source will result in suboptimal performance). Our recommendation is to use 95% compression ratio for LMDB, or to achieve maximum theoretical performance - don't use any data layer.

  • Change batchsize in prototxt files. On some configurations higher batchsize will leads to better results.

  • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_corres-2.

  • Our Recommended rules for seting Hyper Parameters for googlenet v1

  • Batch_per_node <=128

  • Learning Rate (LR) for total_batch: LR=0.07-0.08 for 1024 / LR=0.03 for 256 / LR=0.005 for 32 (and you can rescale it for any total batch)

  • Number_of_iterations * number_of_nodes * batch_per_node = 32 * 2.400.000 (If you wish to achieve slightly better accuracy use 32 * 2.400.000 * 1.2)

  • Our multinode configuration for 8 nodes:

batch_size_per_node = 128
base_lr: 0.07
max_iter: 75000
lr_policy: "poly"
power: 0.5
momentum: 0.9
weight_decay: 0.0002
solver_mode: CPU
  • The best scaling performance on multi-node can be achieved by using 2^n-1 nodes which guarantees a perfect binary-tree.

*Other names and brands may be claimed as the property of others

Clone this wiki locally