diff --git a/.gitignore b/.gitignore index 8dbad9d..eabb560 100644 --- a/.gitignore +++ b/.gitignore @@ -2,4 +2,5 @@ docs/_build/ docs/build/ docs/**/*.mo -.vscode \ No newline at end of file +.vscode +.idea diff --git a/docs/source/benchmark/speed_benchmark.rst b/docs/source/benchmark/speed_benchmark.rst index 7555ff1..297cf71 100644 --- a/docs/source/benchmark/speed_benchmark.rst +++ b/docs/source/benchmark/speed_benchmark.rst @@ -1,455 +1,682 @@ -Speed Benchmark +Qwen2.5 Speed Benchmark ========================= -.. attention:: - To be updated for Qwen2.5. This section reports the speed performance of bf16 models, quantized models -(including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2 series. Specifically, we +(including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2.5 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under the conditions of different context lengths. The environment of the evaluation with huggingface transformers is: - NVIDIA A100 80GB -- CUDA 11.8 -- Pytorch 2.1.2+cu118 -- Flash Attention 2.3.3 -- Transformers 4.38.2 -- AutoGPTQ 0.7.1 -- AutoAWQ 0.2.4 +- CUDA 12.1 +- torch==2.3.1 +- flash_attn==2.5.8 +- transformers==4.46.0 +- auto_gptq==0.7.1+cu1210 (Compiled from source code) +- autoawq==0.2.6 + The environment of the evaluation with vLLM is: - NVIDIA A100 80GB -- CUDA 11.8 -- Pytorch 2.3.0+cu118 -- Flash Attention 2.5.6 -- Transformers 4.40.1 -- vLLM 0.4.2 +- CUDA 12.1 +- vllm==0.6.3 +- torch==2.4.0 +- flash_attn==2.6.3 +- transformers==4.46.0 + -Note: +Notes: - We use the batch size of 1 and the least number of GPUs as - possible for the evalution. + possible for the evaluation. - We test the speed and memory of generating 2048 tokens with the input lengths of 1, 6144, 14336, 30720, 63488, and 129024 - tokens (\>32k is only avaliable for Qwen2-72B-Instuct and Qwen2-7B-Instuct). + tokens. - For vLLM, the memory usage is not reported because it pre-allocates all GPU memory. We use ``gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False`` by default. + - 0.5B (Transformer) -+---------------------+--------------+--------------+---------+-----------------+----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | -+=====================+==============+==============+=========+=================+================+ -| Qwen2-0.5B-Instruct | 1 | BF16 | 1 | 49.94 | 1.17 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 36.35 | 0.85 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 49.56 | 0.68 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 38.78 | 0.68 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 6144 | BF16 | 1 | 50.83 | 6.42 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 36.56 | 6.09 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 49.63 | 5.93 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 38.73 | 5.92 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 14336 | BF16 | 1 | 49.56 | 13.48 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 36.23 | 13.15 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 48.68 | 12.97 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 38.94 | 12.99 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 30720 | BF16 | 1 | 49.25 | 27.61 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 34.61 | 27.28 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 48.18 | 27.12 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 38.19 | 27.11 | -+---------------------+--------------+--------------+---------+-----------------+----------------+ ++-------------------------+--------------+--------------+---------+-----------------+----------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | ++=========================+==============+==============+=========+=================+================+ +| Qwen2.5-0.5B-Instruct | 1 | BF16 | 1 | 47.40 | 0.97 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int8 | 1 | 35.17 | 0.64 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int4 | 1 | 50.60 | 0.48 | ++ + +--------------+---------+-----------------+----------------+ +| | | AWQ | 1 | 37.09 | 0.68 | ++ +--------------+--------------+---------+-----------------+----------------+ +| | 6144 | BF16 | 1 | 47.45 | 1.23 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int8 | 1 | 36.47 | 0.90 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int4 | 1 | 48.89 | 0.73 | ++ + +--------------+---------+-----------------+----------------+ +| | | AWQ | 1 | 37.04 | 0.72 | ++ +--------------+--------------+---------+-----------------+----------------+ +| | 14336 | BF16 | 1 | 47.11 | 1.60 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int8 | 1 | 35.44 | 1.26 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int4 | 1 | 48.26 | 1.10 | ++ + +--------------+---------+-----------------+----------------+ +| | | AWQ | 1 | 37.14 | 1.10 | ++ +--------------+--------------+---------+-----------------+----------------+ +| | 30720 | BF16 | 1 | 47.16 | 2.34 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int8 | 1 | 36.25 | 2.01 | ++ + +--------------+---------+-----------------+----------------+ +| | | GPTQ-Int4 | 1 | 49.22 | 1.85 | ++ + +--------------+---------+-----------------+----------------+ +| | | AWQ | 1 | 36.90 | 1.84 | ++-------------------------+--------------+--------------+---------+-----------------+----------------+ -- 0.5B (vLLM) -+---------------------+--------------+--------------+---------+-----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | -+=====================+==============+==============+=========+=================+ -| Qwen2-0.5B-Instruct | 1 | BF16 | 1 | 270.49 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 235.95 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 240.07 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 233.31 | -+ +--------------+--------------+---------+-----------------+ -| | 6144 | BF16 | 1 | 256.16 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 224.30 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 226.41 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 222.83 | -+ +--------------+--------------+---------+-----------------+ -| | 14336 | BF16 | 1 | 108.89 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 108.10 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 106.51 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 104.16 | -+ +--------------+--------------+---------+-----------------+ -| | 30720 | BF16 | 1 | 97.20 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 94.49 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 93.94 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 92.23 | -+---------------------+--------------+--------------+---------+-----------------+ +- 0.5B (vLLM) -- 1.5B (Transformer) ++-------------------------+--------------+--------------+---------+-----------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | ++=========================+==============+==============+=========+=================+ +| Qwen2.5-0.5B-Instruct | 1 | BF16 | 1 | 311.55 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 257.07 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 260.93 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 261.95 | ++ +--------------+--------------+---------+-----------------+ +| | 6144 | BF16 | 1 | 304.79 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 254.10 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 257.33 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 259.80 | ++ +--------------+--------------+---------+-----------------+ +| | 14336 | BF16 | 1 | 290.28 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 243.69 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 247.01 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 249.58 | ++ +--------------+--------------+---------+-----------------+ +| | 30720 | BF16 | 1 | 264.51 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 223.86 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 226.50 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 229.84 | ++-------------------------+--------------+--------------+---------+-----------------+ -+---------------------+--------------+--------------+---------+-----------------+----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | -+=====================+==============+==============+=========+=================+================+ -| Qwen2-1.5B-Instruct | 1 | BF16 | 1 | 40.89 | 3.44 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 31.51 | 2.31 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 42.47 | 1.67 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 33.62 | 1.64 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 6144 | BF16 | 1 | 40.86 | 8.74 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 31.31 | 7.59 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 42.78 | 6.95 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 32.90 | 6.92 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 14336 | BF16 | 1 | 40.08 | 15.92 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 31.19 | 14.79 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 42.25 | 14.14 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 33.24 | 14.12 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 30720 | BF16 | 1 | 34.09 | 30.31 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 28.52 | 29.18 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 31.30 | 28.54 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 32.16 | 28.51 | -+---------------------+--------------+--------------+---------+-----------------+----------------+ -- 1.5B (vLLM) -+---------------------+--------------+--------------+---------+-----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | -+=====================+==============+==============+=========+=================+ -| Qwen2-1.5B-Instruct | 1 | BF16 | 1 | 175.55 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 172.28 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 184.58 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 170.87 | -+ +--------------+--------------+---------+-----------------+ -| | 6144 | BF16 | 1 | 166.23 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 164.32 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 174.04 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 162.81 | -+ +--------------+--------------+---------+-----------------+ -| | 14336 | BF16 | 1 | 83.67 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 98.63 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 97.65 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 92.48 | -+ +--------------+--------------+---------+-----------------+ -| | 30720 | BF16 | 1 | 77.69 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 86.42 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 87.49 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 82.88 | -+---------------------+--------------+--------------+---------+-----------------+ +- 1.5B (Transformer) ++--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++==========================+==============+==============+=========+=================+================+=========================+ +| Qwen2.5-1.5B-Instruct | 1 | BF16 | 1 | 39.68 | 2.95 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 32.62 | 1.82 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 43.33 | 1.18 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 31.70 | 1.51 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 6144 | BF16 | 1 | 40.88 | 3.43 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 31.46 | 2.30 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 43.96 | 1.66 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 32.30 | 1.63 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 14336 | BF16 | 1 | 40.43 | 4.16 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 31.06 | 3.03 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 43.66 | 2.39 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 32.39 | 2.36 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 30720 | BF16 | 1 | 38.59 | 5.62 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 31.04 | 4.49 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 35.68 | 3.85 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 31.95 | 3.82 | | ++--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ -- 7B (Transformer) -+-------------------+--------------+--------------+---------+-----------------+----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | -+===================+==============+==============+=========+=================+================+ -| Qwen2-7B-Instruct | 1 | BF16 | 1 | 37.97 | 14.92 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 30.85 | 8.97 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 36.17 | 6.06 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 33.08 | 5.93 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 6144 | BF16 | 1 | 34.74 | 20.26 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 31.13 | 14.31 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 33.34 | 11.40 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 30.86 | 11.27 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 14336 | BF16 | 1 | 26.63 | 27.71 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 24.58 | 21.76 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 25.81 | 18.86 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 27.61 | 18.72 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 30720 | BF16 | 1 | 17.49 | 42.62 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 1 | 16.69 | 36.67 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 17.17 | 33.76 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 17.87 | 33.63 | -+-------------------+--------------+--------------+---------+-----------------+----------------+ +- 1.5B (vLLM) ++--------------------------+--------------+--------------+---------+-----------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | ++==========================+==============+==============+=========+=================+ +| Qwen2.5-1.5B-Instruct | 1 | BF16 | 1 | 183.33 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 201.67 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 217.03 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 213.74 | ++ +--------------+--------------+---------+-----------------+ +| | 6144 | BF16 | 1 | 176.68 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 192.83 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 206.63 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 203.64 | ++ +--------------+--------------+---------+-----------------+ +| | 14336 | BF16 | 1 | 168.69 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 183.69 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 195.88 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 192.64 | ++ +--------------+--------------+---------+-----------------+ +| | 30720 | BF16 | 1 | 152.04 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 162.82 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 173.57 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 170.20 | ++--------------------------+--------------+--------------+---------+-----------------+ -- 7B (vLLM) -+-------------------+--------------+--------------+---------+-----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | -+===================+==============+==============+=========+=================+ -| Qwen2-7B-Instruct | 1 | BF16 | 1 | 80.45 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 114.32 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 143.40 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 96.65 | -+ +--------------+--------------+---------+-----------------+ -| | 6144 | BF16 | 1 | 76.41 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 107.02 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 131.55 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 91.38 | -+ +--------------+--------------+---------+-----------------+ -| | 14336 | BF16 | 1 | 66.54 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 89.72 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 97.93 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 76.87 | -+ +--------------+--------------+---------+-----------------+ -| | 30720 | BF16 | 1 | 55.83 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 71.58 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 81.48 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 63.62 | -+ +--------------+--------------+---------+-----------------+ -| | 63488 | BF16 | 1 | 41.20 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 49.37 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 54.12 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 45.89 | -+ +--------------+--------------+---------+-----------------+ -| | 129024 | BF16 | 1 | 25.01 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int8 | 1 | 27.73 | -+ + +--------------+---------+-----------------+ -| | | GPTQ-Int4 | 1 | 29.39 | -+ + +--------------+---------+-----------------+ -| | | AWQ | 1 | 27.13 | -+-------------------+--------------+--------------+---------+-----------------+ - - -- 57B-A14B (Transformer) - -+--------------------------+--------------+--------------+---------+-----------------+----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | -+==========================+==============+==============+=========+=================+================+ -| Qwen2-57B-A14B-Instruct | 1 | BF16 | 2 | 4.76 | 110.29 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 5.55 | 30.38 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 6144 | BF16 | 2 | 4.90 | 117.80 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 5.44 | 35.67 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 14336 | BF16 | 2 | 4.58 | 128.17 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 5.31 | 43.11 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 30720 | BF16 | 2 | 4.12 | 163.77 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 4.72 | 58.01 | -+--------------------------+--------------+--------------+---------+-----------------+----------------+ - -- 57B-A14B (vLLM) + +- 3B (Transformer) + ++--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++==========================+==============+==============+=========+=================+================+=========================+ +| Qwen2.5-3B-Instruct | 1 | BF16 | 1 | 30.80 | 5.95 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 25.69 | 3.38 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 35.21 | 2.06 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 25.29 | 2.50 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 6144 | BF16 | 1 | 32.20 | 6.59 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 24.69 | 3.98 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 34.47 | 2.67 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 24.86 | 2.62 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 14336 | BF16 | 1 | 31.72 | 7.47 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 24.70 | 4.89 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 34.36 | 3.58 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 25.19 | 3.54 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 30720 | BF16 | 1 | 25.37 | 9.30 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 21.67 | 6.72 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 23.60 | 5.41 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 24.56 | 5.37 | | ++--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ + + +- 3B (vLLM) +--------------------------+--------------+--------------+---------+-----------------+ | Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | +==========================+==============+==============+=========+=================+ -| Qwen2-57B-A14B-Instruct | 1 | BF16 | 2 | 31.44 | -+--------------------------+--------------+--------------+---------+-----------------+ -| | 6144 | BF16 | 2 | 31.77 | -+--------------------------+--------------+--------------+---------+-----------------+ -| | 14336 | BF16 | 2 | 21.25 | -+--------------------------+--------------+--------------+---------+-----------------+ -| | 30720 | BF16 | 2 | 20.24 | +| Qwen2.5-3B-Instruct | 1 | BF16 | 1 | 127.61 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 150.02 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 168.20 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 165.50 | ++ +--------------+--------------+---------+-----------------+ +| | 6144 | BF16 | 1 | 123.15 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 143.09 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 159.85 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 156.38 | ++ +--------------+--------------+---------+-----------------+ +| | 14336 | BF16 | 1 | 117.35 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 135.50 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 149.35 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 147.75 | ++ +--------------+--------------+---------+-----------------+ +| | 30720 | BF16 | 1 | 105.88 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int8 | 1 | 118.38 | ++ + +--------------+---------+-----------------+ +| | | GPTQ-Int4 | 1 | 129.28 | ++ + +--------------+---------+-----------------+ +| | | AWQ | 1 | 127.19 | +--------------------------+--------------+--------------+---------+-----------------+ -Note: Compared with dense models, MOE models have larger throughput when batch size is large, which is shown as follows: -+--------------------------+--------------+-------------+------+----------+ -| Model | Quantization | # Prompts | QPS | Tokens/s | -+==========================+==============+=============+======+==========+ -| Qwen1.5-32B-Chat | BF16 | 100 | 6.68 | 7343.56 | -+--------------------------+--------------+-------------+------+----------+ -| Qwen2-57B-A14B-Instruct | BF16 | 100 | 4.81 | 5291.15 | -+--------------------------+--------------+-------------+------+----------+ -| Qwen1.5-32B-Chat | BF16 | 1000 | 7.99 | 8791.35 | -+--------------------------+--------------+-------------+------+----------+ -| Qwen2-57B-A14B-Instruct | BF16 | 1000 | 5.18 | 5698.37 | -+--------------------------+--------------+-------------+------+----------+ -The results are obtained from vLLM throughput benchmarking scripts, which can be reproduced by: +- 7B (Transformer) + ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++=============================+==============+==============+=========+=================+================+=========================+ +| Qwen2.5-7B-Instruct | 1 | BF16 | 1 | 40.38 | 14.38 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 31.55 | 8.42 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 43.10 | 5.52 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 32.03 | 5.39 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 6144 | BF16 | 1 | 38.76 | 15.38 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 31.26 | 9.43 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 38.27 | 6.52 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 32.37 | 6.39 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 14336 | BF16 | 1 | 29.78 | 16.91 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 26.86 | 10.96 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 28.70 | 8.05 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 30.23 | 7.92 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 30720 | BF16 | 1 | 18.83 | 19.97 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 17.59 | 14.01 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 18.45 | 11.11 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 19.11 | 10.98 | | ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ + + + +- 7B (vLLM) + ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB)| Note | ++=============================+==============+==============+=========+=================+================+===========================================+ +| Qwen2.5-7B-Instruct | 1 | BF16 | 1 | 84.28 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 122.01 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 154.05 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 148.10 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 6144 | BF16 | 1 | 80.70 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 112.38 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 141.98 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 137.64 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 14336 | BF16 | 1 | 77.69 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 105.25 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 129.35 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 124.91 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 30720 | BF16 | 1 | 70.33 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 90.71 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 108.30 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 104.66 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 63488 | BF16 | 1 | 50.86 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 60.52 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 67.97 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 66.42 | | setting-64k | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 129024 | BF16 | 1 | 28.94 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 25.97 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 26.37 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 26.57 | | vllm==0.6.2, new sample config | ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ + * [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) + +- 14B (Transformer) + ++--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++==========================+==============+==============+=========+=================+================+=========================+ +| Qwen2.5-14B-Instruct | 1 | BF16 | 1 | 24.74 | 28.08 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 18.84 | 16.11 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 25.89 | 9.94 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 19.23 | 9.79 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 6144 | BF16 | 1 | 20.51 | 29.50 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 17.80 | 17.61 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 20.06 | 11.36 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 19.21 | 11.22 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 14336 | BF16 | 1 | 13.92 | 31.95 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 12.66 | 19.98 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 13.79 | 13.81 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 14.17 | 13.67 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------+ +| | 30720 | BF16 | 1 | 8.20 | 36.85 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int8 | 1 | 7.77 | 24.88 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | GPTQ-Int4 | 1 | 8.14 | 18.71 | | ++ + +--------------+---------+-----------------+----------------+-------------------------+ +| | | AWQ | 1 | 8.31 | 18.57 | | ++--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+ + + +- 14B (vLLM) + ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB)| Note | ++=============================+==============+==============+=========+=================+================+===========================================+ +| Qwen2.5-14B-Instruct | 1 | BF16 | 1 | 46.30 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 70.40 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 98.02 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 92.66 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 6144 | BF16 | 1 | 43.83 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 64.33 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 86.10 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 83.11 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 14336 | BF16 | 1 | 41.91 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 59.21 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 76.85 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 74.03 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 30720 | BF16 | 1 | 37.18 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 49.23 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 60.91 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 59.01 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 63488 | BF16 | 1 | 26.85 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 32.83 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 37.67 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 36.71 | | setting-64k | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 129024 | BF16 | 1 | 14.53 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 15.10 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 15.13 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 15.25 | | vllm==0.6.2, new sample config | ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ + * [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) + + + +- 32B (Transformer) + ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++=============================+==============+==============+=========+=================+================+===========================================+ +| Qwen2.5-32B-Instruct | 1 | BF16 | 1 | 17.54 | 61.58 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 14.52 | 33.56 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 19.20 | 18.94 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 14.60 | 18.67 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 6144 | BF16 | 1 | 12.49 | 63.72 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 11.61 | 35.86 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 13.42 | 21.09 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 13.81 | 20.81 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 14336 | BF16 | 1 | 8.95 | 67.31 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 8.53 | 39.28 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 9.48 | 24.67 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 9.71 | 24.39 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 30720 | BF16 | 1 | 5.59 | 74.47 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 5.42 | 46.45 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 5.79 | 31.84 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 5.85 | 31.56 | | ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ + + + + + +- 32B (vLLM) + ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++=============================+==============+==============+=========+=================+================+===========================================+ +| Qwen2.5-32B-Instruct | 1 | BF16 | 1 | 22.13 | | setting1 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 37.57 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 55.83 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 51.92 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 6144 | BF16 | 1 | 21.05 | | setting1 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 34.67 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 49.96 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 46.68 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 14336 | BF16 | 1 | 19.91 | | setting1 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 31.89 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 44.79 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 41.83 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 30720 | BF16 | 2 | 31.82 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 26.88 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 35.66 | | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 33.75 | | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 63488 | BF16 | 2 | 24.45 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 18.60 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 22.72 | | setting-64k | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 21.79 | | setting-64k | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 129024 | BF16 | 2 | 14.31 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 1 | 9.77 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 10.39 | | vllm==0.6.2, new sample config | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 10.34 | | vllm==0.6.2, new sample config | ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ + + * For context length 129024, the model needs to be predicted with the following config: "model_max_length"=131072 + * [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False) + * [Setting 1]=(gpu_memory_utilization=1.0 max_model_len=32768 enforce_eager=True) + * [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) + -``python vllm/benchmarks/benchmark_throughput.py --input-len 1000 --output-len 100 --model --num-prompts --enforce-eager -tp 2`` - 72B (Transformer) -+--------------------+--------------+--------------+---------+-----------------+----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | -+====================+==============+==============+=========+=================+================+ -| Qwen2-72B-Instruct | 1 | BF16 | 2 | 7.45 | 134.74 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 7.30 | 71.00 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 9.05 | 41.80 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 9.96 | 41.31 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 6144 | BF16 | 2 | 5.99 | 144.38 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 5.93 | 80.60 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 6.79 | 47.90 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 7.49 | 47.42 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 14336 | BF16 | 3 | 4.12 | 169.93 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 4.43 | 95.14 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 4.87 | 57.79 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 1 | 5.23 | 57.30 | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 30720 | BF16 | 3 | 2.86 | 209.03 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 2.83 | 124.20 | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 3.02 | 107.94 | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 1.85 | 88.60 | -+--------------------+--------------+--------------+---------+-----------------+----------------+ ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++=============================+==============+==============+=========+=================+================+===========================================+ +| Qwen2.5-72B-Instruct | 1 | BF16 | 2 | 8.73 | 136.20 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 8.66 | 72.61 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 11.07 | 39.91 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 11.50 | 39.44 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 6144 | BF16 | 2 | 6.39 | 140.00 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 6.39 | 77.81 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 7.56 | 42.50 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 8.17 | 42.13 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 14336 | BF16 | 3 | 4.25 | 149.14 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 4.66 | 82.55 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 5.27 | 46.86 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 1 | 5.57 | 46.38 | | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 30720 | BF16 | 3 | 2.94 | 164.79 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 2.94 | 94.75 | auto_gptq==0.6.0+cu1210 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 2 | 3.14 | 62.57 | | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 2 | 3.23 | 61.64 | | ++-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ + + - 72B (vLLM) -+--------------------+--------------+--------------+---------+-----------------+----------------+ -| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Setting | -+====================+==============+==============+=========+=================+================+ -| Qwen2-72B-Instruct | 1 | BF16 | 2 | 17.68 | [Setting 1] | -+ + +--------------+---------+-----------------+----------------+ -| | | BF16 | 4 | 30.01 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 27.56 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 29.60 | [Setting 2] | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 42.82 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 27.73 | - | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 6144 | BF16 | 4 | 27.98 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 25.46 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 1 | 25.16 | [Setting 3] | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 38.23 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 25.77 | - | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 14336 | BF16 | 4 | 21.81 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 22.71 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 26.54 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 21.50 | - | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 30720 | BF16 | 4 | 19.43 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 18.69 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 23.12 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 18.09 | - | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 30720 | BF16 | 4 | 19.43 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 18.69 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 23.12 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 18.09 | - | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 63488 | BF16 | 4 | 17.46 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 2 | 15.30 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 13.23 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 13.14 | - | -+ +--------------+--------------+---------+-----------------+----------------+ -| | 129024 | BF16 | 4 | 11.70 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int8 | 4 | 12.94 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | GPTQ-Int4 | 2 | 8.33 | - | -+ + +--------------+---------+-----------------+----------------+ -| | | AWQ | 2 | 7.78 | - | -+--------------------+--------------+--------------+---------+-----------------+----------------+ ++------------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note | ++==============================+==============+==============+=========+=================+================+===========================================+ +| Qwen2.5-72B-Instruct | 1 | BF16 | 2 | 18.19 | | Setting 1 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | BF16 | 4 | 31.37 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 31.40 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 16.47 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 2 | 46.30 | | Setting 2 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 2 | 44.30 | | Default | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 6144 | BF16 | 4 | 29.90 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 29.37 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 1 | 13.88 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 2 | 42.50 | | Setting 3 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 2 | 40.67 | | Default | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 14336 | BF16 | 4 | 30.10 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 27.20 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 2 | 38.10 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 2 | 36.63 | | Default | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 30720 | BF16 | 4 | 27.53 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 23.32 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 2 | 30.98 | | Default | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 2 | 30.02 | | Default | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 63488 | BF16 | 4 | 20.74 | | Setting 4 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 2 | 16.27 | | Setting 4 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 2 | 19.84 | | Setting 4 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 2 | 19.32 | | Setting 4 | ++ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ +| | 129024 | BF16 | 4 | 12.68 | | Setting 5 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int8 | 4 | 14.11 | | Setting 5 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | GPTQ-Int4 | 2 | 10.11 | | Setting 5 | ++ + +--------------+---------+-----------------+----------------+-------------------------------------------+ +| | | AWQ | 2 | 9.88 | | Setting 5 | ++------------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+ * [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False) * [Setting 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True) * [Setting 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True) * [Setting 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True) + * [Setting 4]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False) + * [Setting 5]=(gpu_memory_utilization=0.9 max_model_len=131072 enforce_eager=False)