This crate is designed to run any Machine Learning model on any architecture with ease and efficiency.
It leverages the Triton Inference Server (specifically the Triton C library) and provides a similar API with comparable advantages. However, Tritonserver-rs allows you to build the inference server locally, offering significant performance benefits. Check the benchmark for more details.
Run inference in three simple steps:
Organize your model files in the following structure:
models/
├── yolov8/
| ├── config.pbtxt
| ├── 1/
| │ └── model.onnx
| ├── 2/
| │ └── model.onnx
| └── `<other versions of yolov8>`/
└── `<other models>`/
Rules:
- All models must be stored in the same root directory (
models/
in this example). - Each model resides in its own folder containing:
- A
config.pbtxt
configuration file. - One or more subdirectories, each representing a version of the model and containing the model file (e.g.,
model.onnx
).
- A
Add Tritonserver-rs to your Cargo.toml
:
[dependencies]
tritonserver-rs = "0.1"
Then write your application code:
use tritonserver_rs::{Buffer, options::Options, Server};
use std::time::Duration;
// Configure server options.
let mut opts = Options::new("models/")?;
opts.exit_timeout(Duration::from_secs(5))?
.backend_directory("/opt/tritonserver/backends")?;
// Create the server.
let server = Server::new(opts).await?;
// Input data.
let image = image::open("/data/cats.jpg")?;
let image = image.as_flat_samples_u8();
// Create a request (specify the model name and version).
let mut request = server.create_request("yolov8", 2)?;
// Add input data and an allocator.
request
.add_default_allocator()
.add_input("IMAGE", Buffer::from(image))?;
// Run inference.
let fut = request.infer_async()?;
// Obtain results.
let response = fut.await?;
Here is an example of how to deploy using docker-compose.yml
:
my_app:
image: {DEV_IMAGE}
volumes:
- ./Cargo.toml:/project/
- ./src:/project/src
- ../models:/models
- ../cats.jpg:/data/cats.jpg
entrypoint: ["cargo", "run", "--manifest-path=/project/Cargo.toml"]
We recommend using Dockerfile.dev as {DEV_IMAGE}
. For more details on suitable images and deployment instructions, see DEPLOY.md.
For further details, check out the following resources:
- Examples: Learn how to run various ML models using Tritonserver-rs, configure inference, prepare models, and deploy.
- Model configuration guide.
- Build and deployment instructions.
- Benchmark results.
- Triton Inference Server guides.
- Documentation on docs.rs.
- Versatility: Extensive configuration options for models and servers.
- High performance: Optimized for maximum efficiency.
- Broad backend support: Run PyTorch, ONNX, TensorFlow, TensorRT, OpenVINO, model pipelines, and custom backends out of the box.
- Compatibility: Supports most GPUs and architectures.
- Multi-model handling: Handle multiple models simultaneously.
- Prometheus integration: Built-in support for monitoring.
- CUDA-optimized: Directly handle model inputs and outputs on GPU memory.
- Dynamic server management: Advanced runtime control features.
- Rust-based: Enjoy the safety, speed, and concurrency benefits of Rust.