Here's my batching/caching API I made over the weekend. 200+ tk/s with Mistral 5.0bpw.exl2 on an RTX 3090 with concurrent requests. It was for a personal project, and it's not complete, but it's very fast. #247
epolewski
started this conversation in
Show and tell
Replies: 1 comment
-
Very cool! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Seems like the kind of crowd who'd enjoy some open source code showing how to implement a batching API with exllamav2:
https://github.com/epolewski/EricLLM
I made it to mostly be a drop-in replacement for vLLM while they fix a bug I can't seem to work around or find a solution to.
Beta Was this translation helpful? Give feedback.
All reactions