-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I optimize a Python BLS model orchestrating onnx models. #7388
Comments
hi @JamesBowerXanda there is also a newer version available: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
I am using the Sagemaker Triton Inference Server containers to run a MultiModel endpoint. One of the models is a MT5 model. I am trying to optimise for the latency and think I am losing time due to data transfer since when I use an equivalent instance type in a notebook with onnxruntime the generation pipeline takes 0.5 seconds but when I send a request through the triton inference server endpoint (with no other models loaded in) the execution time is around 2.5 seconds.
The model is split into an encoder_model.onnx, decoder_model.onnx and decoder_with_past_model.onnx.
What is the best way to optimise this?
Happy to restructure if there is a better way of doing it but I am running multiple models on the sagme gpu instance.
Triton Information
23.08
Are you using the Triton container or did you build it yourself?
Sagemaker container as mentioned here
To Reproduce
Take a t5 or mt5 model and use optimum to get the constituent onnx model.
optimum-cli export onnx --model google/mt5-small onnx-model --device cuda --optimise O4
Take the encoder_model.onnx, decoder_model.onnx and decoder_with_past_model.onnx files and add them to a triton inference server model repository as onnx model running on GPU. I will put the config.pbtxt files at the bottom.
Create a Python BLS model with the model.py file:
Run the inference with the model. Below are the relevant config.pbtxt files.
Expected behavior
I expected the pipeline for a generation request to take approximately the same amount of time rather than 5 times longer.
The text was updated successfully, but these errors were encountered: