You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLM instances need lifecycle management. Two things we are looking for here:
If a user has not interacted with a LLM instance in some amount of time, we kill it to reclaim GPU resources.
If a user tries to spin up a new LLM instance, first we check if we have room, if we don't, either fall back to CPU or kick an older model off of the GPU.
Now that, I'm writing this, maybe we should demote older LLM instances to CPU before/instead of garbage collecting them. That way, when/if someone starts talking to them, we don't need to go through a cold start, but we also aren't hogging GPU.
Anyway, this deserves some attention - as it stands now whenever a user wants to talk to a new type of model, we just keep jamming them onto the GPUs until we inevitably OOM. Not good.
The text was updated successfully, but these errors were encountered:
LLM instances need lifecycle management. Two things we are looking for here:
Now that, I'm writing this, maybe we should demote older LLM instances to CPU before/instead of garbage collecting them. That way, when/if someone starts talking to them, we don't need to go through a cold start, but we also aren't hogging GPU.
Anyway, this deserves some attention - as it stands now whenever a user wants to talk to a new type of model, we just keep jamming them onto the GPUs until we inevitably OOM. Not good.
The text was updated successfully, but these errors were encountered: