This is a wrapper for Triton Inference Server that makes using it with sentence transformers and open CLIP models easy.
To use:
docker run --name ingrain_server -p 8686:8686 -p 8687:8687 --gpus all owenpelliott/ingrain-server:latest
To run without a GPU remove the --gpus all
flag.
This server handles all the model loading, ONNX conversion, memory management, parallelisation, dynamic batching, input pre-processing, image handling, and other complexities of running a model in production. The API is very simple but lets you serve models in a performant manner.
Open CLIP models and sentence transformers are both converted to ONNX and served by Triton. The server can handle multiple models at once.
It retains all the performance of Triton. On 12 cores at 4.3 GHz with a 2080 SUPER 8GB card running in Docker using WSL2, it can serve intfloat/e5-small-v2
to 500 clients at ~1050 QPS, or intfloat/e5-base-v2
to 500 clients at ~860 QPS.
Most models work out of the box, it is intractable to test every sentence transformers model and every CLIP models but most main architectures are tested and work. If you have a model that doesn't work, please open an issue.
The easiest way to get started is the use the optimised Python Client:
pip install ingrain
import ingrain
ingrn = ingrain.Client()
model = ingrn.load_sentence_transformer_model(name="intfloat/e5-small-v2")
response = model.infer_text(text=["I am a sentence.", "I am another sentence.", "I am a third sentence."])
print(f"Processing Time (ms): {response['processingTimeMs']}")
print(f"Text Embeddings: {response['embeddings']}")
You can also have the embeddings automatically be returned as a numpy array:
import ingrain
ingrn = ingrain.Client(return_numpy=True)
model = client.load_sentence_transformer_model(name="intfloat/e5-small-v2")
response = model.infer_text(text=["I am a sentence.", "I am another sentence.", "I am a third sentence."])
print(type(response['embeddings']))
{
"name": "intfloat/e5-small-v2",
}
{
"name": "ViT-B-32",
"pretrained": "laion2b_s34b_b79k"
}
{
"name": "ViT-B-32",
"pretrained": "laion2b_s34b_b79k",
"text": ["I am a sentence.", "I am another sentence.", "I am a third sentence."],
"image": ["https://example.com/image1.jpg", "https://example.com/image2.jpg"]
}
Response
{
"textEmbeddings": [
[0.1, ..., 0.4],
[0.5, ..., 0.8],
[-0.2, ..., 0.3],
]
"imageEmbeddings": [
[0.1, ..., 0.4],
[0.5, ..., 0.8],
],
"processingTimeMs": 24.34
}
Details omitted for brevity.
{
"modelStats": [
{
"name": "ViT-B-32_laion2b_s34b_b79k_image_encoder",
"version": "1",
"inference_stats": {
"success": {...},
"fail": {...},
"queue": {...},
"compute_input": {...},
"compute_infer": {...},
"compute_output": {...},
"cache_hit": {...},
"cache_miss": {...}
}
},
{
"name": "ViT-B-32_laion2b_s34b_b79k_text_encoder",
"version": "1",
"inference_stats": {
...
}
},
{
"name": "intfloat_e5-small-v2",
"version": "1",
"inference_stats": {
...
}
}
]
}
{
"name": "ViT-B-32",
"pretrained": "laion2b_s34b_b79k"
}
{
"name": "ViT-B-32",
"pretrained": "laion2b_s34b_b79k"
}
Build the Docker image:
docker build -t ingrain-server .
Run the Docker container:
docker run --name ingrain_server -p 8686:8686 -p 8687:8687 --gpus all ingrain-server
You can run the benchmark script to test the performance of the server:
Install the python client:
pip install ingrain
Run the benchmark script:
python benchmark.py
It will output some metrics about the inference speed of the server.
{"message":"Model intfloat/e5-small-v2 is already loaded."}
Benchmark results:
Concurrent threads: 500
Requests per thread: 20
Total requests: 10000
Total benchmark time: 9.31 seconds
QPS: 1074.66
Mean response time: 0.3595 seconds
Median response time: 0.3495 seconds
Standard deviation of response times: 0.1174 seconds
Mean inference time: 235.5968 ms
Median inference time: 227.6743 ms
Standard deviation of inference times: 84.8669 ms
Requires Docker and Python to be installed.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
bash run_triton_server_dev.sh
uvicorn inference_server:app --host 127.0.0.1 --port 8686 --reload
uvicorn model_server:app --host 127.0.0.1 --port 8687 --reload
Install pytest
:
pip install pytest
pytest
pytest --integration