How to run

Hybrid data type with custom weight location

How to use hybrid data type with custom weight location:

Change the data type to "bf16_fp16" in demo.py, which means to use BF16 weights for first token，FP16 weights for next tokens.
Set the environment variable: "FIRST_TOKEN_WEIGHT_LOCATION" and "NEXT_TOKEN_WEIGHT_LOCATION". Use the NUMA node ID as the value.

Example: FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=1 SINGLE_INSTANCE=1 OMP_NUM_THREADS=32 taskset -c 56-87 python demo.py

Performance on Intel (R) Xeon (R) CPU Max 9468 with command FIRST_TOKEN_WEIGHT_LOCATION=0 NEXT_TOKEN_WEIGHT_LOCATION=8 OMP_NUM_THREADS=8 numactl -C 0-7 -m 8 ./example /data/chatglm2-6b-cpu 3 1024

First token latency: Next token latency:

对于混合精度的分布式性能测试：

FIRST_TOKEN_WEIGHT_LOCATION=1 NEXT_TOKEN_WEIGHT_LOCATION=3 OMP_NUM_THREADS=20 mpirun \
    -n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False : \
    -n 1 numactl -N 1 -m 3 python demo.py --dtype=bf16_fp16 --token_path /data/chatglm2-6b-hf/ --model_path /data/chatglm2-6b-cpu/ --streaming False

export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run

Hybrid data type with custom weight location

Clone this wiki locally