DLRM v1 Training best known configurations with Intel® Extension for PyTorch.
Use Case | Framework | Model Repo | Branch/Commit/Tag | Optional Patch |
---|---|---|---|---|
Training | PyTorch | https://github.com/facebookresearch/dlrm | - | - |
-
Installation of PyTorch and Intel Extension for PyTorch
-
Installation of Build PyTorch + IPEX + TorchVision Jemalloc and TCMalloc
-
Installation of oneccl-bind-pt (if running distributed)
-
Set Jemalloc and tcmalloc Preload for better performance
The jemalloc and tcmalloc should be built from the General setup section.
export LD_PRELOAD="<path to the jemalloc directory>/lib/libjemalloc.so":"path_to/tcmalloc/lib/libtcmalloc.so":$LD_PRELOAD export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000"
-
Set IOMP preload for better performance
pip install packaging intel-openmp
export LD_PRELOAD=path/lib/libiomp5.so:$LD_PRELOAD
- Set ENV to use fp16 AMX if you are using a supported platform
export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16
The Criteo Terabyte Dataset is used to run DLRM. To download the dataset, you will need to visit the Criteo website and accept their terms of use: https://labs.criteo.com/2013/12/download-terabyte-click-logs/. Copy the download URL into the command below as the and replace the <dir/to/save/dlrm_data> to any path where you want to download and save the dataset.
export DATASET_DIR=<dir/to/save/dlrm_data>
mkdir ${DATASET_DIR} && cd ${DATASET_DIR}
curl -O <download url>/day_{$(seq -s , 0 23)}.gz
gunzip day_*.gz
The raw data will be automatically preprocessed and saved as day_*.npz
to the DATASET_DIR
when DLRM is run for the first time. On subsequent runs, the scripts will automatically use the preprocessed data.
git clone https://github.com/IntelAI/models.git
cd models/models_v2/pytorch/dlrm/training/cpu
- Create virtual environment
venv
and activate it:python3 -m venv venv . ./venv/bin/activate
- Install general model requirements
pip install -r requirements.txt
- Install the latest CPU versions of torch, torchvision and intel_extension_for_pytorch.
- Setup required environment paramaters
Parameter | export command |
---|---|
DISTRIBUTED (leave unset if training single-node) | export DISTRIBUTED=true |
NODE (leave unset if training single-node) | export NODE=2 |
NUM_CCL_WORKER (leave unset if training single-node) | export NUM_CCL_WORKER=4 |
HOSTFILE (leave unset if training single-node) | export HOSTFILE=<your host file> |
OUTPUT_DIR | export OUTPUT_DIR=$PWD |
DATASET_DIR | export DATASET_DIR=<path-to-dlrm_data> or <path-to-preprocessed-data> |
BATCH_SIZE | export BATCH_SIZE=10000 |
PRECISION | export PRECISION=fp32 <specify the precision to run: fp32, bf32 or bf16> |
NUM_BATCH | export NUM_BATCH=<10000 for test performance and 50000 for testing convergence trend> |
(optional)Compile model with PyTorch Inductor backend | export TORCH_INDUCTOR=1 |
- Run
run_model.sh
Single-tile output will typically look like:
accuracy 76.215 %, best 76.215 %
dlrm_inf latency: 0.11193203926086426 s
dlrm_inf avg time: 0.007462135950724284 s, ant the time count is : 15
dlrm_inf throughput: 4391235.996821996 samples/s
Final results of the training run can be found in results.yaml
file.
results:
- key: throughput
value: 4391236.0
unit: inst/s
- key: latency
value: 0.007462135950724283
unit: s
- key: accuracy
value: 76.215
unit: accuracy