-
Is it possible to get some benchmarks? I know you said that torch is 4x faster than ctranslate2 but I was curious about whether int8/float16 was used, torch.compile, stuff like that. Would be good to know for my projects and others. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 4 replies
-
Is this a good starting pointer? https://michaelfeil.eu/infinity/0.0.28/benchmarking/ |
Beta Was this translation helpful? Give feedback.
-
At the end of this post are graphs after I completed my testing. Note: I've only included the models that have 768 dimensions because if you include smaller or larger it skews the graph such that you can't see the minor differences. However, I have that additional data if anyone's interested. Testing procedure: I ran each model 3x for each of the batch sizes indicated on the graphs. Each metrics is an average of those three runs. I used an RTX 4090. The models all processed 4095 identical text "chunks" (size of 800) created by the recursive character text splitter from Langchain. Takeaways: NONE of the models were hindered by VRAM. In other words, VRAM was never maxed. Rather, all of the speed gains flatlined at a certain batch size where "CUDA usage" was maxed (per Windows Task Manager), which indicates that the compute power (i.e. CUDA cores) was the bottleneck. Essentially, at a certain point throwing more information at the CUDA cores (i.e. a larger batch size) did nothing, and it actually hurt performance (discussed below). Using the It's crucial to understand that the time savings was NOT - I repeat NOT - due to to Again, I only say all this because through my testing I discovered that the Knowing this, the "optimal" batch size varied slightly between the models, but generally speaking it was 4 for normal, and 8 for For example, with The conclusion to draw from this is that when running the full size So WHY IS To further complicate matters...again, because CUDA cores are the bottleneck, the optimal batch size would change on a different GPU. For example, the RTX 4090 has 16384 CUDA cores while the RTX 4080 has 10240 CUDA cores. Therefore, logically, to find the optimal batch size for an RTX 4080 you would 4 by (10240/16384)...resulting in an ideal batch size of 2.5 (which you'd have to round down to 2). The number of CUDA cores for various models are here: https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#RTX_40_series Enough explanation, here are the graphs: |
Beta Was this translation helpful? Give feedback.
-
Here's the graph just showing the large models, plus instructor-xl, same settings as above. You can see that anything other than a batch size of 1 actually decreases performance for And here's the same comparison but VRAM usage...Again, why is sentence-transformers setting a default batch size of 32? |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
Here's the graph just showing the large models, plus instructor-xl, same settings as above. You can see that anything other than a batch size of 1 actually decreases performance for
instructor-xl
but not the others. However, these are all NONfloat16
. As I mentioned above, you can get away withinstructor-xl
on a batch size of 2 for significant improvement, granted, my test was on an RTX 4090...And here's the same comparison but VRAM usage...Again, why is sentence-transformers setting a default batch size of 32?