Aphrodite Benchmark Tests in two 2x3090 rigs #147

miku448 · 2023-12-05T21:51:58Z

miku448
Dec 5, 2023

Aphrodite Benchmark Tests in two 2x3090 rigs

Throughput with 100 PROMPTS | Latency with 4096 tokens input-len and 200 tokens output-len
Running commit: d4ff350

Rig ID	Processor (cores)	PCIe lanes	RAM	Motherboard
RIG 1	Ryzen 7 5700G (16)	16	48GB DDR4 (2133)	TUF GAMING B450-PLUS II
RIG 2	Ryzen 9 3900X (24)	20	32GB DDR4 (2133)	MPG B550 GAMING PLUS

Rig ID	Model	Quant	GPU (PCIe speed)	In t/s	Out t/s	req/s	Avg Latency (s)
RIG 1	OpenHermes-2.5-Mistral-7B	FP16	1x3090 (x1)	790.99	652.95	3.01	17.01 (32k)
RIG 1	MythoMax-L2-13B	GPTQ	1x3090 (x1)	366.14	347.11	1.48	5.22
RIG 1	MLewd-ReMM-L2-Chat-20B	GPTQ	1x3090 (x1)	211.79	200.79	0.86	8.10
RIG 1	LLaMA2-13B-Tiefighter	FP16	2x3090 (x1+x1)	83.64	79.29	0.34	32.15
RIG 1	Nethena-20B	FP16	2x3090 (x1+x1)	33.01	31.29	0.13	70.85
RIG 1	lzlv_70B	GPTQ	2x3090 (x1+x1)	29.57	28.03	0.12	85.97

RIG 1	OpenHermes-2.5-Mistral-7B	FP16	1x3090 (x8)	789.99	651.96	3.00	17.54 (32k)
RIG 1	MythoMax-L2-13B	GPTQ	1x3090 (x8)	365.19	346.22	1.48	5.17
RIG 1	MLewd-ReMM-L2-Chat-20B	GPTQ	1x3090 (x8)	210.00	199.09	0.85	8.11
RIG 1	LLaMA2-13B-Tiefighter	FP16	2x3090 (x8+x4)	207.63	196.84	0.84	12.92
RIG 1	Nethena-20B	FP16	2x3090 (x8+x4)	74.19	70.33	0.30	31.40
RIG 1	lzlv_70B	GPTQ	2x3090 (x8+x4)	73.25	69.44	0.30	31.16

RIG 2	juanako-7b-UNA	FP16	1x3090 (x4)	752.06	620.81	2.86	17.72 (32k)
RIG 2	LLaMA2-13B-Tiefighter	GPTQ	1x3090 (x4)	339.14	321.51	1.36	5.50
RIG 2	MLewd-ReMM-L2-Chat-20B	GPTQ	1x3090 (x4)	194.13	184.05	0.79	8.89
RIG 2	lzlv_70B	GPTQ	2x3090 (x16+x4)	96.66	91.66	0.39	21.92

Conclusion

If a model fits in a single GPU, x1 is good enough. Increasing the PCIe lanes does not improve performance.
For split models, PCIe speed is important.
x8+x4 increases performance by 2.4x (PCIe 2.0 x4 and PCIe 3.0 x8)
x16+x4 increases performance by 3.2x, this could be because of PCIe 3.0 x4 and not the x16 speed increase
Hypotetically, for x16+x16 we should see 4.9x performance increase. lzlv_70B would have 146.5 t/s throughput
Hypotetically, for x16+x8 we should see 3.7x performance increase. lzlv_70B would have 109.88 t/s throughput

# Commands run in rig1
CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.throughput --model ../models/lzlv_70B-GPTQ -q gptq -tp 2 --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.latency --model ../models/lzlv_70B-GPTQ -q gptq -tp 2 --input-len 512 --output-len 200
CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.throughput --model ../models/MLewd-ReMM-L2-Chat-20B-GPTQ -q gptq --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.latency --model ../models/MLewd-ReMM-L2-Chat-20B-GPTQ -q gptq --input-len 512 --output-len 200
CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.throughput --model ../models/MythoMax-L2-13B-GPTQ -q gptq --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.latency --model ../models/MythoMax-L2-13B-GPTQ -q gptq --input-len 512 --output-len 200
CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.throughput --model ../models/LLaMA2-13B-Tiefighter -tp 2 --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.latency --model ../models/LLaMA2-13B-Tiefighter -tp 2 --input-len 512 --output-len 200
CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.throughput --model ../models/OpenHermes-2.5-Mistral-7B --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.latency --model ../models/OpenHermes-2.5-Mistral-7B --input-len 4096 --output-len 200
CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.throughput --model ../models/Nethena-20B -tp 2 --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100
CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.latency --model ../models/Nethena-20B -tp 2 --input-len 512 --output-len 200

AlpinDale · 2023-12-06T01:59:35Z

AlpinDale
Dec 6, 2023
Maintainer

Thanks for the benchmarks. As pointed out in #115, Ray pins all processes in 2 cores only. I recommend running the command gargamel provided after the engine launches properly.

As for single GPU, please start the engine with --engine-use-ray. The throughput test doesn't have this yet but I'll add it later. Starting single-gpu instances with Ray increases the throughput and reduces latency.

If you're willing to get your hands dirty, you can try the dev branch. We have some improvements, so it should be even faster now compared to main.

0 replies

miku448 · 2023-12-07T19:56:03Z

miku448
Dec 7, 2023
Author

@AlpinDale thanks for the heads up on that bug
I did further testing with a SLI this time. I did it on the dev branch and with the command for the ray issue after the engine loads (before the test progress starts).

Benchmark tests using SLI

Rig ID	Model	Quant	GPU (PCIe speed)	In t/s	Out t/s	req/s	Avg Latency (s)	Prompts
RIG 1	7B	FP16	1x3090 (x1)	742.21	612.68	2.82	13.13 (32k)	100
RIG 1	7B	FP16	2x3090 (SLI)	995.72	821.95	3.78	12.73 (32k)	100
RIG 1	13B	FP16	2x3090 (SLI)	486.72	511.84	2.09	7.71	1000
RIG 1	13B	GPTQ	2x3090 (SLI)	436.99	414.28	1.77	4.70	100
RIG 1	20B	FP16	2x3090 (SLI)	224.68	236.28	0.95	9.58	1000
RIG 1	20B	GPTQ	2x3090 (SLI)	29.63	28.09	0.12	-	100
RIG 1	70B	GPTQ	2x3090 (SLI)	137.80	130.64	0.56	14.01	100

wtf with 20B-gptq, I tried a couple of times but got the same result. Could it be that being a merged model has something to do?

branch=dev

7B=OpenHermes-2.5-Mistral-7B
13B=LLaMA2-13B-Tiefighter, MythoMax-L2-13B-GPTQ
20B=Nethena-20B,MLewd-ReMM-L2-Chat-20B-GPTQ
70B=lzlv_70B-GPTQ

# Exmaple Commands
CUDA_VISIBLE_DEVICES=1,2 ./runtime.sh python ./tests/throughput.py --model ../models/Nethena-20B --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 -tp 2
CUDA_VISIBLE_DEVICES=1,2 ./runtime.sh python ./tests/latency.py --model ../models/Nethena-20B --input-len 512 --output-len 200 -tp 2

# SLI NVLINK speed
# https://asia.evga.com/products/product.aspx?pn=100-2W-0130-LR
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s

1 reply

puppetm4st3r May 13, 2024

may me a little out of scope, but Aphrodite supports having four gpus with 2 sli bridges? (2 gpus per sli bridge)

Lucidology · 2024-04-19T04:08:10Z

Lucidology
Apr 19, 2024

When I try this there is an extremely long delay, maybe like 15 minutes, between "INFO: Graph capturing finished in 466 secs." and "Processed prompts: " before it starts to print progress with the first prompt.

If I try to run it with any more than 10 prompts it takes forever, and running it with just 10 prompts is very slow. I am doing this on a Tesla P40. It seems like it's much slower than what I get when I run the API server and send it requests over the port. Any idea why this happens? Here's what the session looks like:

(base) m@m-Precision-Tower-7910-DBE:~/aphrodite-engine$ CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.benchmarks.throughput --model /home/m/TheBloke-Starling-LM-7B-alpha-GPTQ/ -q gptq -tp 1 --dataset ~/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --dtype=half


Namespace(backend='aphrodite', dataset='/home/m/ShareGPT_V3_unfiltered_cleaned_split.json', model='/home/m/TheBloke-Starling-LM-7B-alpha-GPTQ/', tokenizer='/home/m/TheBloke-Starling-LM-7B-alpha-GPTQ/', quantization='gptq', gpu_memory_utilization=0.88, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=10, seed=0, hf_max_batch_size=None, trust_remote_code=False, dtype='half', kv_cache_dtype='auto', disable_custom_all_reduce=False, context_shift=False, enforce_eager=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Casting torch.bfloat16 to torch.float16.
WARNING:  gptq quantization is not fully optimized yet. The speed can be slower than
non-quantized models.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = '/home/m/TheBloke-Starling-LM-7B-alpha-GPTQ/'
INFO:     DataType = torch.float16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = gptq
INFO:     Context Length = 8192
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     flash_attn is not found. Using xformers backend.
INFO:     Model weights loaded. Memory usage: 3.88 GiB x 1 = 3.88 GiB
INFO:     # GPU blocks: 7236, # CPU blocks: 2048
INFO:     Minimum concurrency: 14.13x
INFO:     Maximum sequence length allowed in the cache: 115776
INFO:     Capturing the model for CUDA graphs. This may lead to unexpected consequences
if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or
use '--enforce-eager' in the CLI.
WARNING:  CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are running
out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
Capturing graph... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 35/35 0:00:00
INFO:     Graph capturing finished in 466 secs.
Processed prompts: 100%|████████████████████████████████| 10/10 [06:42<00:00, 40.24s/it]
Throughput: 0.02 requests/s, Input tokens/s: 6.47, Output tokens/s: 3.64

Also, it says "Minimum concurrency: 14.13x", should that be maximum?

0 replies

puppetm4st3r · 2024-09-21T02:52:33Z

puppetm4st3r
Sep 21, 2024

I'm testing on 3x3090 tp 1 pp 3, sended 100 request at once of +-2048 input tokens each, marlin awq qwen 2.5 32b, fp8 kvcache... got an average of 900 tok/sec of ouput generation. no nvlink, pcie 4.0 at 4x mother board asus prime bleeeh.

very nice throughput, prompt ingestion was in order of 1 minute for those 100 requests.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aphrodite Benchmark Tests in two 2x3090 rigs #147

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Aphrodite Benchmark Tests in two 2x3090 rigs #147

miku448 Dec 5, 2023