Skip to content
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.

Recommendations to achieve best performance

Pawel Noga edited this page Dec 17, 2016 · 29 revisions

To achieve best performance with Intel® Distribution of Caffe* on Intel CPU please apply the following recommendations:

Hardware / BIOS configuration:

  • Make sure that your hardware configurations includes fast SSD (M.2) drive. If during trainings/scoring you will observe in logs "waiting for data" - you should install better SSD or reduce batchsize.
  • With Intel Xeon Phi™ product family - enter BIOS (MCDRAM section) and set MCDRAM mode as cache
  • Enable Hyper-treading (HT) on your platform - those setting can be found in BIOS (CPU section).
  • Optimize hardware in BIOS: set CPU max frequency, set 100% fan speed, check cooling system.
  • For multinode Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use:
Processor C6 = Enabled
Snoop Holdoff Count = 9
Intel Turbo Boost Technology = Enabled
Uncore settings: Cluster Mode: All2All

Software / OS configuration:

  • With Intel Xeon Phi™ product family - it is recommended to use Linux Centos 7.2 or newer
  • It is recommended to use newest XPPSL software for Intel Xeon Phi™ product family: [https://software.intel.com/en-us/articles/xeon-phi-software#downloads] (https://software.intel.com/en-us/articles/xeon-phi-software#downloads)
  • For multinode Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use:

irqbalance needs to be installed and configured with --hintpolicy=exact option

CPU frequency needs to be set via intel_pstate driver:

echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo
cpupower frequency-set -g performance```
  • Make sure that there are no unnecessary processes during training and scoring. Intel® Distribution of Caffe* is using all available resources and other processes (like monitoring tools, java processes, network traffic etc.) might impact performance.
  • We recommend to compile Intel® Distribution of Caffe* with gcc 4.8.5 (or newer)
  • We recommend to compile Intel® Distribution of Caffe* with makefile.configuration set to:
CPU_ONLY := 1
USE_MKL2017_AS_DEFAULT_ENGINE := 1
BLAS := mkl

Intel® Distribution of Caffe / Hyper-Parameters configuration:*

  • We provide two sets of prototxt files with Hyper-Parameters and network topologies. In default set you will find standard topologies and their configuration used by community. In BKM (Best Know Method) you will find our internaly developed solution optimized for Intel MKL2017 and Intel CPU.

  • When running performance and trainings - we recommend to start working with default sets to establish baseline.

  • Use LMDB data layer (Using ‘Images’ layer as data source will result in suboptimal performance). Our recommendation is to use 95% compression ratio for LMDB, or to achieve maximum theoretical performance - don't use any data layer.

  • Change batchsize in prototxt files. On some configurations higher batchsize will leads to better results.

  • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_corres-2.

  • Our Recommended rules for setting Hyper Parameters for googlenet v1

  • Batch_per_node <=128

  • Learning Rate (LR) for total_batch: LR=0.07-0.08 for 1024 / LR=0.03 for 256 / LR=0.005 for 32 (and you can rescale it for any total batch)

  • Number_of_iterations * number_of_nodes * batch_per_node = 32 * 2.400.000 (If you wish to achieve slightly better accuracy use 32 * 2.400.000 * 1.2)

  • Our multinode configuration for 8 nodes:

batch_size_per_node = 128
base_lr: 0.07
max_iter: 75000
lr_policy: "poly"
power: 0.5
momentum: 0.9
weight_decay: 0.0002
solver_mode: CPU

Intel distribution of Caffe Benchmark Intel distribution of Caffe allow user to easly benchmark any topology and check their perofmrance. To run it just enter the command: caffe time --model=[path_path_to_prtotxt_file_with_model] -iterations [number_of_iterations] Example ./build/tools/caffe time --model=models/bvlc_googlenet/train_val.prototxt -iterations 100

To achieve results in images/s follow find last section of the log: Average Forward pass: xxx ms. Average Backward pass: xxx ms. Average Forward-Backward: xxx ms. and use equation: [Images/s] = batchsize * 1000 / Average Forward-Backward [ms]


*Other names and brands may be claimed as the property of others

Clone this wiki locally