Skip to content

A wrapper to simplify serving Sentence Transformer and Open CLIP Models with Triton

Notifications You must be signed in to change notification settings

OwenPendrighElliott/ingrain_server

Repository files navigation

Ingrain Server

This is a wrapper for Triton Inference Server that makes using it with sentence transformers and open CLIP models easy.

To use:

docker run --name ingrain_server -p 8686:8686 -p 8687:8687 --gpus all owenpelliott/ingrain-server:latest

To run without a GPU remove the --gpus all flag.

What does it do?

This server handles all the model loading, ONNX conversion, memory management, parallelisation, dynamic batching, input pre-processing, image handling, and other complexities of running a model in production. The API is very simple but lets you serve models in a performant manner.

Open CLIP models and sentence transformers are both converted to ONNX and served by Triton. The server can handle multiple models at once.

How does it perform?

It retains all the performance of Triton. On 12 cores at 4.3 GHz with a 2080 SUPER 8GB card running in Docker using WSL2, it can serve intfloat/e5-small-v2 to 500 clients at ~1050 QPS, or intfloat/e5-base-v2 to 500 clients at ~860 QPS.

How compatible is it?

Most models work out of the box, it is intractable to test every sentence transformers model and every CLIP models but most main architectures are tested and work. If you have a model that doesn't work, please open an issue.

Usage

The easiest way to get started is the use the optimised Python Client:

pip install ingrain
import ingrain

ingrn = ingrain.Client()

model = ingrn.load_sentence_transformer_model(name="intfloat/e5-small-v2")

response = model.infer_text(text=["I am a sentence.", "I am another sentence.", "I am a third sentence."])

print(f"Processing Time (ms): {response['processingTimeMs']}")
print(f"Text Embeddings: {response['embeddings']}")

You can also have the embeddings automatically be returned as a numpy array:

import ingrain

ingrn = ingrain.Client(return_numpy=True)

model = client.load_sentence_transformer_model(name="intfloat/e5-small-v2")

response = model.infer_text(text=["I am a sentence.", "I am another sentence.", "I am a third sentence."])

print(type(response['embeddings']))

Example Requests and Responses

Loading a sentence transformer model POST /load_sentence_transformer_model:
{
    "name": "intfloat/e5-small-v2",
}
Loading a CLIP model POST /load_clip_model:
{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k"
}
Inference request POST /infer:
{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k",
    "text": ["I am a sentence.", "I am another sentence.", "I am a third sentence."],
    "image": ["https://example.com/image1.jpg", "https://example.com/image2.jpg"]
}

Response

{
    "textEmbeddings": [
        [0.1, ..., 0.4],
        [0.5, ..., 0.8],
        [-0.2, ..., 0.3],
    ]
    "imageEmbeddings": [
        [0.1, ..., 0.4],
        [0.5, ..., 0.8],
    ],
    "processingTimeMs": 24.34
}
Get inference metrics GET /metrics:

Details omitted for brevity.

{
  "modelStats": [
    {
      "name": "ViT-B-32_laion2b_s34b_b79k_image_encoder",
      "version": "1",
      "inference_stats": {
        "success": {...},
        "fail": {...},
        "queue": {...},
        "compute_input": {...},
        "compute_infer": {...},
        "compute_output": {...},
        "cache_hit": {...},
        "cache_miss": {...}
      }
    },
    {
      "name": "ViT-B-32_laion2b_s34b_b79k_text_encoder",
      "version": "1",
      "inference_stats": {
        ...
      }
    },
    {
      "name": "intfloat_e5-small-v2",
      "version": "1",
      "inference_stats": {
        ...
      }
    }
  ]
}
Unloading a model POST /unload_model:
{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k"
}
Delete a model POST /delete_model:
{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k"
}

Build Container Locally

Build the Docker image:

docker build -t ingrain-server .

Run the Docker container:

docker run --name ingrain_server -p 8686:8686 -p 8687:8687 --gpus all ingrain-server

Performance test

You can run the benchmark script to test the performance of the server:

Install the python client:

pip install ingrain

Run the benchmark script:

python benchmark.py

It will output some metrics about the inference speed of the server.

{"message":"Model intfloat/e5-small-v2 is already loaded."}
Benchmark results:
Concurrent threads: 500
Requests per thread: 20
Total requests: 10000
Total benchmark time: 9.31 seconds
QPS: 1074.66
Mean response time: 0.3595 seconds
Median response time: 0.3495 seconds
Standard deviation of response times: 0.1174 seconds
Mean inference time: 235.5968 ms
Median inference time: 227.6743 ms
Standard deviation of inference times: 84.8669 ms

Development Setup

Requires Docker and Python to be installed.

Create a virtual environment

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run the Triton Inference Server

bash run_triton_server_dev.sh

Run the FastAPI server

uvicorn inference_server:app --host 127.0.0.1 --port 8686 --reload
uvicorn model_server:app --host 127.0.0.1 --port 8687 --reload

Testing

Install pytest:

pip install pytest

Unit tests

pytest

Integration tests and unit tests

pytest --integration

About

A wrapper to simplify serving Sentence Transformer and Open CLIP Models with Triton

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published