-
Notifications
You must be signed in to change notification settings - Fork 65
How to run
Changqing Li edited this page May 31, 2024
·
5 revisions
How to use hybrid data type with custom weight location:
- Change the data type to "bf16_fp16" in demo.py, which means to use BF16 weights for first token,FP16 weights for next tokens.
- Set the environment variable: "FIRST_TOKEN_WEIGHT_LOCATION" and "NEXT_TOKEN_WEIGHT_LOCATION". Use the NUMA node ID as the value.
Example: FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=1 SINGLE_INSTANCE=1 OMP_NUM_THREADS=32 taskset -c 56-87 python demo.py
Performance on Intel (R) Xeon (R) CPU Max 9468 with command
FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=8 OMP_NUM_THREADS=8 numactl -C 0-7 -m 8 ./example /data/chatglm2-6b-cpu 3 1024
First token latency: Next token latency:
对于混合精度的分布式性能测试:
FIRST_TOKEN_WEIGHT_LOCATION=1 NEXT_TOKEN_WEIGHT_LOCATION=3 OMP_NUM_THREADS=20 mpirun \
-n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False : \
-n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')