Allow parallel inference #724

petrm · 2024-12-09T07:54:28Z

I would like the inference queue to allow parallel execution of jobs.

The use case is availability of multiple load ballanced ollama backends that would speed up processing.

Not easily.

faster gpu with more ram
multiple gpus on the same host
something else than ollama that can run jobs distributed on multiple machines (like https://github.com/exo-explore/exo maybe?)

No response

kamtschatka added the feature request New feature or request label Dec 13, 2024

Provide feedback