Skip to content
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.

Recommendations to achieve best performance

Daisy Deng edited this page Sep 27, 2017 · 29 revisions

To achieve best performance with Intel® Distribution of Caffe* on Intel CPU please try the following configurations, and it is strongly recommended to tune the configurations on your specific machine.

Hardware / BIOS configuration:

  • Make sure that your hardware configurations include fast SSD (M.2) drive. If during training or scoring you observe in logs "waiting for data" - you should install better SSD or reduce batchsize.
  • BIOS configurations
    • Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz: Turbo Boost Technology: on Hyper-treading (HT): off NUMA: off
    • Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz: Turbo Boost Technology: on Hyper-treading (HT): on NUMA: off Memory Mode: cache
    • Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz: Turbo Boost Technology: on Hyper-treading (HT): on NUMA: on
  • Optimize hardware in BIOS: set CPU max frequency, set 100% fan speed, check cooling system.
  • For multinode Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use: Processor C6 = Enabled Snoop Holdoff Count = 9

Software / OS configuration:

  • It is recommended to use Linux Centos 7.2 or newer for Intel Caffe
  • It is recommended to use newest XPPSL software for Intel Xeon Phi™ product family: [https://software.intel.com/en-us/articles/xeon-phi-software#downloads] (https://software.intel.com/en-us/articles/xeon-phi-software#downloads)
  • For multinode Intel Xeon and Intel Xeon Phi™ product family over Intel® Omni-Path Architecture use:

irqbalance needs to be installed and configured with --hintpolicy=exact option

CPU frequency needs to be set via intel_pstate driver:

echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct

echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo

cpupower frequency-set -g performance

  • Make sure that there are no unnecessary processes during training and scoring. Intel® Distribution of Caffe* is using all available resources and other processes (like monitoring tools, java processes, network traffic etc.) might impact performance.

  • We recommend to compile Intel® Distribution of Caffe* with gcc 4.8.5 (or newer) or with Intel Compiler, see Build Caffe with Intel Compiler.

  • We recommend compiling Intel® Distribution of Caffe* with MKLDNN engine, see Installation Guide.

  • Install Intel Caffe and run performance benchmark.

  • With Intel Xeon Scalable processors (Skylake) , we recommend the following configurations: export OMP_NUM_THREADS=56

export KMP_AFFINITY=granularity=fine,compact

For example: numactl -l $TARGET_CAFFE_BUILD_DIR/tools/caffe time -iterations 100 -model –engine=MKLDNN`

Intel® Distribution of Caffe / Hyper-Parameters configuration:

  • We provide two sets of prototxt files with Hyper-Parameters and network topologies. In default set you will find standard topologies and their configuration used by community. In BKM (Best Know Method) you will find our internally developed solution optimized for Intel MKLDNN and MKL2017 on Intel CPU.

  • When running performance and trainings - we recommend to starting working with default sets to establish baseline.

  • Use LMDBdata layer (Using ‘Images’ layer as data source will result in suboptimal performance). Our recommendation is to use 95% compression ratio for LMDB, or to achieve maximum theoretical performance - don't use any data layer.

  • Change batchsize in prototxt files. On some configurations higher batchsize will leads to better results.

  • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For Intel Xeon Knights Mill and Knights Lading multi-node test we recommend to use OMP_NUM_THREADS = numer_of_corres-4. For Intel Xeon Scalable processors (Skylake) product we recommend to use OMP_NUM_THREADS = numer_of_corres-2.

  • For our recommended Hyper Parameter, please see models/intel_optimized_models.

  • It is possible to speed-up training by Convolution weights initialization with Gabor Filters

Clone this wiki locally