benchmarks? #160

BBC-Esq · 2024-03-18T23:06:43Z

BBC-Esq
Mar 18, 2024

Is it possible to get some benchmarks? I know you said that torch is 4x faster than ctranslate2 but I was curious about whether int8/float16 was used, torch.compile, stuff like that. Would be good to know for my projects and others.

Answered by BBC-Esq

Mar 20, 2024

Here's the graph just showing the large models, plus instructor-xl, same settings as above. You can see that anything other than a batch size of 1 actually decreases performance for instructor-xl but not the others. However, these are all NON float16. As I mentioned above, you can get away with instructor-xl on a batch size of 2 for significant improvement, granted, my test was on an RTX 4090...

And here's the same comparison but VRAM usage...Again, why is sentence-transformers setting a default batch size of 32?

View full answer

michaelfeil · 2024-03-18T23:48:09Z

michaelfeil
Mar 18, 2024
Maintainer

Is this a good starting pointer? https://michaelfeil.eu/infinity/0.0.28/benchmarking/

0 replies

BBC-Esq · 2024-03-20T17:26:21Z

BBC-Esq
Mar 20, 2024
Author

At the end of this post are graphs after I completed my testing. Note: I've only included the models that have 768 dimensions because if you include smaller or larger it skews the graph such that you can't see the minor differences. However, I have that additional data if anyone's interested.

Testing procedure: I ran each model 3x for each of the batch sizes indicated on the graphs. Each metrics is an average of those three runs. I used an RTX 4090. The models all processed 4095 identical text "chunks" (size of 800) created by the recursive character text splitter from Langchain.

Takeaways:

NONE of the models were hindered by VRAM. In other words, VRAM was never maxed. Rather, all of the speed gains flatlined at a certain batch size where "CUDA usage" was maxed (per Windows Task Manager), which indicates that the compute power (i.e. CUDA cores) was the bottleneck. Essentially, at a certain point throwing more information at the CUDA cores (i.e. a larger batch size) did nothing, and it actually hurt performance (discussed below).

Using the bge-base model as an example, there was no meaningful time savings above a batch size of 4. However, using float16 allowed speed improvements up to and including a batch size of 8. This makes perfect sense because using float16 reduces the compute requirement (compared to identical settings without float16).

It's crucial to understand that the time savings was NOT - I repeat NOT - due to to float16 inherently running more efficiently. This is evident by the fact that when using a batch size of 1 all models performed virtually identical to their float16 counterparts (BUT THIS CHANGES IF YOU USE LARGER MODELS):

Again, float16 does NOT "inherently" provide speedups (AGAIN, THIS CHANGES FOR LARGER MODELS). Rather, it reduces the compute power required to process a batch, which, in turn, allows you to specify a larger batch size to optimally max out your CUDA usage.

I only say all this because through my testing I discovered that the SentenceTransformer class sets a default batch size of 32 in the encode method. However, once you go above the "optimal" batch size you actually start INCREASING the compute time (not to mention VRAM usage).

Knowing this, the "optimal" batch size varied slightly between the models, but generally speaking it was 4 for normal, and 8 for float16. Larger models were different and I have those metrics if anyone's actually interested.

For example, with instructor-xl anything above a batch size of 1 lead to inefficiencies while running it in float16 the optimal batch size was 2. Furthermore, EVEN WITH a batch size of 1, using float16 improved performance, as shown here:

The conclusion to draw from this is that when running the full size instructor-xl model - even on an RTX 4090 - you should never use a batch size larger than 1. Furthermore, because float16 gave a significant improvement even with a batch size of 1, it means that even the CUDA usage on an RTX 4090 is being maxed before it can process a single batch without using float16. Here's a snippet of another graph showing the difference:

So WHY IS sentence-transformers setting the default batch size to 32?! Most people don't know how to change this through encode_kwargs. I for one didn't know this and have been using the default batch size for months!

To further complicate matters...again, because CUDA cores are the bottleneck, the optimal batch size would change on a different GPU. For example, the RTX 4090 has 16384 CUDA cores while the RTX 4080 has 10240 CUDA cores. Therefore, logically, to find the optimal batch size for an RTX 4080 you would 4 by (10240/16384)...resulting in an ideal batch size of 2.5 (which you'd have to round down to 2).

The number of CUDA cores for various models are here:

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#RTX_40_series

Enough explanation, here are the graphs:

2 replies

michaelfeil Mar 22, 2024
Maintainer

Depends on the sequence length. How many tokens are you sending per sentence? (You mentioned 800 characters, but how many tokens)?
it makes sense that batch-size 8 maxes out the max throughput for ~512 token sequences. This would not be the case for shorter sequences.

currently, all are processed as float16.

BBC-Esq Mar 22, 2024
Author

Correct, a token is approximately 3-4 characters so all testing was under the 512 limit, and even the max sequence length of the shorter models. All tests should have been in non-float16 except where I indicated float16...But yet, if the tokens processed in a chunk was lowered it's increase the time somewhat as well as the VRAM...but again, VRAM wasn't the issue...

BBC-Esq · 2024-03-20T23:07:01Z

BBC-Esq
Mar 20, 2024
Author

Here's the graph just showing the large models, plus instructor-xl, same settings as above. You can see that anything other than a batch size of 1 actually decreases performance for instructor-xl but not the others. However, these are all NON float16. As I mentioned above, you can get away with instructor-xl on a batch size of 2 for significant improvement, granted, my test was on an RTX 4090...

And here's the same comparison but VRAM usage...Again, why is sentence-transformers setting a default batch size of 32?

2 replies

bash99 Aug 14, 2024

Maybe the fast way is computing token length of each sentence first, then decide how many sentences we send in one batch

michaelfeil Aug 14, 2024
Maintainer

@bash99 #87 This would help, especially for large / extremely long token requests. (> 2048 ctx len). VRAM usage is O(N_CTX_BATCH), where N_CTX_BATCH is the total number of tokens in a batch.

BBC-Esq · 2024-03-22T04:16:59Z

BBC-Esq
Mar 22, 2024
Author

Depends on the sequence length. How many tokens are you sending per sentence? (You mentioned 800 characters, but how many tokens)? it makes sense that batch-size 8 maxes out the max throughput for ~512 token sequences. This would not be the case for shorter sequences.

currently, all are processed as non-float16 except where noted I believe.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks? #160

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

benchmarks? #160

BBC-Esq Mar 18, 2024

Replies: 4 comments · 4 replies

michaelfeil Mar 18, 2024 Maintainer

BBC-Esq Mar 20, 2024 Author

michaelfeil Mar 22, 2024 Maintainer

BBC-Esq Mar 22, 2024 Author

BBC-Esq Mar 20, 2024 Author

bash99 Aug 14, 2024

michaelfeil Aug 14, 2024 Maintainer

BBC-Esq Mar 22, 2024 Author

BBC-Esq
Mar 18, 2024

Replies: 4 comments 4 replies

michaelfeil
Mar 18, 2024
Maintainer

BBC-Esq
Mar 20, 2024
Author

michaelfeil Mar 22, 2024
Maintainer

BBC-Esq Mar 22, 2024
Author

BBC-Esq
Mar 20, 2024
Author

michaelfeil Aug 14, 2024
Maintainer

BBC-Esq
Mar 22, 2024
Author