PyTorch LLaMA2 7B/13B inference (generation)

Description

This document has instructions for running LLaMA2 7B and LLaMA2 13B inference (generation) using Intel-optimized PyTorch.

Bare Metal

General setup

Follow link to install and build Pytorch, IPEX, TorchVison and TCMalloc.

Model Specific Setup

Install Intel OpenMP

pip install packaging intel-openmp accelerate

Set IOMP and tcmalloc Preload for better performance

export LD_PRELOAD="<path_to>/tcmalloc/lib/libtcmalloc.so":"<path_to_iomp>/lib/libiomp5.so":$LD_PRELOAD

Set ENV to use fp16 AMX if you are using a supported platform
```
export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16
```

Inference

git clone https://github.com/IntelAI/models.git
cd models/models_v2/pytorch/llama/inference/cpu
Create virtual environment venv and activate it:
```
python3 -m venv venv
. ./venv/bin/activate
```
Run setup.sh
```
./setup.sh
```
Install the latest CPU versions of torch, torchvision and intel_extension_for_pytorch

Set INPUT_TOKEN before running the model

export INPUT_TOKEN=32
(choice in [32 64 128 256 512 1024 2016], we prefer to benchmark on 32 and 2016)

Set OUTPUT_TOKEN before running the model

export OUTPUT_TOKEN=32
(32 is preferred, while you could set any other length)

Set FINETUNED_MODEL to llama2 7b or llama2 13b before running

#Test llama2 7b
export FINETUNED_MODEL="meta-llama/Llama-2-7b-hf"
#Test llama2 13b
export FINETUNED_MODEL="meta-llama/Llama-2-13b-hf"

About the BATCH_SIZE in scripts

using BATCH_SIZE=1 for realtime mode
using BATCH_SIZE=N for throughput mode (N could be further tuned according to the testing host, by default using 1);

About the BEAM_SIZE in scripts

using BEAM_SIZE=4 by default

Do calibration to get "qconfig.json" before running INT8.

#optional: qconfig.json is saved in this repo, you can also do calibration by yourself to re-generation it
bash do_quantization.sh calibration sq #using smooth quant as default

#unzip qconfig.zip to get qconfig.json, if you meet error to use this uploaded version of qconfig.zip, please re-generation it as above
unzip qconfig.zip

Setup required environment paramaters

Parameter	export command
TEST_MODE (THROUGHPUT, ACCURACY, REALTIME)	`export TEST_MODE=THROUGHPUT`
OUTPUT_DIR	`export OUTPUT_DIR=<path to an output directory>`
FINETUNED_MODEL	`#Test llama2 7b: export FINETUNED_MODEL="meta-llama/Llama-2-7b-hf"; #Test llama2 13b: export FINETUNED_MODEL="meta-llama/Llama-2-13b-hf"`
PRECISION	`export PRECISION=bf16` (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16)
INPUT_TOKEN	`export INPUT_TOKEN=32 (choice in [32 64 128 256 512 1024 2016], we prefer to benchmark on 32 and 2016)`
OUTPUT_TOKEN	`export OUTPUT_TOKEN=32 (32 is preferred, while you could set any other length)`
MODEL_DIR	`export MODEL_DIR=$(pwd)`
BATCH_SIZE (optional)	`export BATCH_SIZE=256`
CORE_PER_INSTANCE (required for REALTIME)	`export CORE_PER_INSTANCE=4`

Output

Single-tile output will typically looks like:

2024-05-17 22:35:31,097 - root - INFO - ---------- Summary: ----------
2024-05-17 22:35:31,097 - root - INFO - inference-latency: 18.211 sec.
2024-05-17 22:35:31,097 - root - INFO - first-token-latency: 4.227 sec.
2024-05-17 22:35:31,097 - root - INFO - rest-token-latency: 0.110 sec.
2024-05-17 22:35:31,097 - root - INFO - P90-rest-token-latency: 0.111 sec.
2024-05-17 22:35:36,648 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;total-latency;bf16;1; 18.179000
2024-05-17 22:35:36,655 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;first-token-latency;bf16;1; 4.238500
2024-05-17 22:35:36,664 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;rest-token-latency;bf16;1; 0.110000
2024-05-17 22:35:36,671 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;P90-rest-token-latency;bf16;1; 0.110500
2024-05-17 22:35:36,678 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;token_per_sec;bf16;1; 9.110
2024-05-17 22:35:36,686 - root - INFO - meta-llama/Llama-2-7b-hf;Input/Output Token;1024/128;latency;first_token_thp;bf16;1; 0.236

Final results of the inference run can be found in results.yaml file.

results:
- key: first token throughput
  value: 15.648000
- key: rest token throughput
  value: 0.284250
- key: first token latency
  value: 4.238500
- key: rest_token_latency
  value: 0.110000
- key: accuracy
  value: 93.17

License

LICENSE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PyTorch LLaMA2 7B/13B inference (generation)

Description

Bare Metal

General setup

Model Specific Setup

Inference

Output

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

PyTorch LLaMA2 7B/13B inference (generation)

Description

Bare Metal

General setup

Model Specific Setup

Inference

Output

License