Triton inference server dynamic load #7459

madisi98 · 2024-07-19T15:46:11Z

madisi98
Jul 19, 2024

I’m implementing triton for my computer vision infrastructure. I have a couple of details I want to understand before proceeding.

First of all, I wonder whether I can dynamically load models. All the models I have don’t fit in the GPU I will be using and that is not a problem, since model load time is not crucial because when I need a model I’ll need it for a while and I’m not worried about that load time. In my head it would be something like unloading the less frequently used models to make room for the most recently requested. Is this something triton can do? If not, is there a way to implement it with api calls ?

The second thing I want to sort out is the image loading process. Currently, the worker that analyses images requests the images to an external tile engine through http get and then it would have to sen it to triton server. Is there a way to send the load url for triton to get the image from the tile engine directly, saving some redundant time in the process?

dyastremsky · 2024-07-25T14:13:10Z

dyastremsky
Jul 25, 2024
Collaborator

You can dynamically load models in explicit model control mode: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_management.html#model-control-mode-explicit

As far as dynamically loading the image, you should be able to do it via a Python model. You can even use an ensemble model, if you want to break out the Python model into its own step.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton inference server dynamic load #7459

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Triton inference server dynamic load #7459

madisi98 Jul 19, 2024

Replies: 1 comment

dyastremsky Jul 25, 2024 Collaborator

madisi98
Jul 19, 2024

dyastremsky
Jul 25, 2024
Collaborator