Aphrodite Benchmark Tests in two 2x3090 rigs #147
Replies: 4 comments 1 reply
-
Thanks for the benchmarks. As pointed out in #115, Ray pins all processes in 2 cores only. I recommend running the command gargamel provided after the engine launches properly. As for single GPU, please start the engine with If you're willing to get your hands dirty, you can try the dev branch. We have some improvements, so it should be even faster now compared to main. |
Beta Was this translation helpful? Give feedback.
-
@AlpinDale thanks for the heads up on that bug Benchmark tests using SLI
branch=dev
7B=OpenHermes-2.5-Mistral-7B
13B=LLaMA2-13B-Tiefighter, MythoMax-L2-13B-GPTQ
20B=Nethena-20B,MLewd-ReMM-L2-Chat-20B-GPTQ
70B=lzlv_70B-GPTQ
# Exmaple Commands
CUDA_VISIBLE_DEVICES=1,2 ./runtime.sh python ./tests/throughput.py --model ../models/Nethena-20B --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 -tp 2
CUDA_VISIBLE_DEVICES=1,2 ./runtime.sh python ./tests/latency.py --model ../models/Nethena-20B --input-len 512 --output-len 200 -tp 2
# SLI NVLINK speed
# https://asia.evga.com/products/product.aspx?pn=100-2W-0130-LR
Link 0: 14.062 GB/s
Link 1: 14.062 GB/s
Link 2: 14.062 GB/s
Link 3: 14.062 GB/s |
Beta Was this translation helpful? Give feedback.
-
When I try this there is an extremely long delay, maybe like 15 minutes, between "INFO: Graph capturing finished in 466 secs." and "Processed prompts: " before it starts to print progress with the first prompt. If I try to run it with any more than 10 prompts it takes forever, and running it with just 10 prompts is very slow. I am doing this on a Tesla P40. It seems like it's much slower than what I get when I run the API server and send it requests over the port. Any idea why this happens? Here's what the session looks like:
Also, it says "Minimum concurrency: 14.13x", should that be maximum? |
Beta Was this translation helpful? Give feedback.
-
I'm testing on 3x3090 tp 1 pp 3, sended 100 request at once of +-2048 input tokens each, marlin awq qwen 2.5 32b, fp8 kvcache... got an average of 900 tok/sec of ouput generation. no nvlink, pcie 4.0 at 4x mother board asus prime bleeeh. very nice throughput, prompt ingestion was in order of 1 minute for those 100 requests. |
Beta Was this translation helpful? Give feedback.
-
Aphrodite Benchmark Tests in two 2x3090 rigs
Throughput with 100 PROMPTS | Latency with 4096 tokens input-len and 200 tokens output-len
Running commit: d4ff350
Conclusion
# Commands run in rig1 CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.throughput --model ../models/lzlv_70B-GPTQ -q gptq -tp 2 --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.latency --model ../models/lzlv_70B-GPTQ -q gptq -tp 2 --input-len 512 --output-len 200 CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.throughput --model ../models/MLewd-ReMM-L2-Chat-20B-GPTQ -q gptq --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.latency --model ../models/MLewd-ReMM-L2-Chat-20B-GPTQ -q gptq --input-len 512 --output-len 200 CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.throughput --model ../models/MythoMax-L2-13B-GPTQ -q gptq --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.latency --model ../models/MythoMax-L2-13B-GPTQ -q gptq --input-len 512 --output-len 200 CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.throughput --model ../models/LLaMA2-13B-Tiefighter -tp 2 --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.latency --model ../models/LLaMA2-13B-Tiefighter -tp 2 --input-len 512 --output-len 200 CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.throughput --model ../models/OpenHermes-2.5-Mistral-7B --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 CUDA_VISIBLE_DEVICES=0 ./runtime.sh python -m tests.latency --model ../models/OpenHermes-2.5-Mistral-7B --input-len 4096 --output-len 200 CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.throughput --model ../models/Nethena-20B -tp 2 --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 100 CUDA_VISIBLE_DEVICES=0,1 ./runtime.sh python -m tests.latency --model ../models/Nethena-20B -tp 2 --input-len 512 --output-len 200
Beta Was this translation helpful? Give feedback.
All reactions