Ingrain Server

This is a wrapper for Triton Inference Server that makes using it with sentence transformers and open CLIP models easy.

To use:

docker run --name ingrain_server -p 8686:8686 -p 8687:8687 --gpus all owenpelliott/ingrain-server:latest

To run without a GPU remove the --gpus all flag.

What does it do?

This server handles all the model loading, ONNX conversion, memory management, parallelisation, dynamic batching, input pre-processing, image handling, and other complexities of running a model in production. The API is very simple but lets you serve models in a performant manner.

Open CLIP models and sentence transformers are both converted to ONNX and served by Triton. The server can handle multiple models at once.

How does it perform?

It retains all the performance of Triton. On 12 cores at 4.3 GHz with a 2080 SUPER 8GB card running in Docker using WSL2, it can serve intfloat/e5-small-v2 to 500 clients at ~1050 QPS, or intfloat/e5-base-v2 to 500 clients at ~860 QPS.

How compatible is it?

Most models work out of the box, it is intractable to test every sentence transformers model and every CLIP models but most main architectures are tested and work. If you have a model that doesn't work, please open an issue.

Usage

The easiest way to get started is the use the optimised Python Client:

pip install ingrain

import ingrain

ingrn = ingrain.Client()

model = ingrn.load_sentence_transformer_model(name="intfloat/e5-small-v2")

response = model.infer_text(text=["I am a sentence.", "I am another sentence.", "I am a third sentence."])

print(f"Processing Time (ms): {response['processingTimeMs']}")
print(f"Text Embeddings: {response['embeddings']}")

You can also have the embeddings automatically be returned as a numpy array:

import ingrain

ingrn = ingrain.Client(return_numpy=True)

model = client.load_sentence_transformer_model(name="intfloat/e5-small-v2")

response = model.infer_text(text=["I am a sentence.", "I am another sentence.", "I am a third sentence."])

print(type(response['embeddings']))

Example Requests and Responses

Loading a sentence transformer model `POST /load_sentence_transformer_model`:

{
    "name": "intfloat/e5-small-v2",
}

Loading a CLIP model `POST /load_clip_model`:

{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k"
}

Inference request `POST /infer`:

{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k",
    "text": ["I am a sentence.", "I am another sentence.", "I am a third sentence."],
    "image": ["https://example.com/image1.jpg", "https://example.com/image2.jpg"]
}

Response

{
    "textEmbeddings": [
        [0.1, ..., 0.4],
        [0.5, ..., 0.8],
        [-0.2, ..., 0.3],
    ]
    "imageEmbeddings": [
        [0.1, ..., 0.4],
        [0.5, ..., 0.8],
    ],
    "processingTimeMs": 24.34
}

Get inference metrics `GET /metrics`:

Details omitted for brevity.

{
  "modelStats": [
    {
      "name": "ViT-B-32_laion2b_s34b_b79k_image_encoder",
      "version": "1",
      "inference_stats": {
        "success": {...},
        "fail": {...},
        "queue": {...},
        "compute_input": {...},
        "compute_infer": {...},
        "compute_output": {...},
        "cache_hit": {...},
        "cache_miss": {...}
      }
    },
    {
      "name": "ViT-B-32_laion2b_s34b_b79k_text_encoder",
      "version": "1",
      "inference_stats": {
        ...
      }
    },
    {
      "name": "intfloat_e5-small-v2",
      "version": "1",
      "inference_stats": {
        ...
      }
    }
  ]
}

Unloading a model `POST /unload_model`:

{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k"
}

Delete a model `POST /delete_model`:

{
    "name": "ViT-B-32",
    "pretrained": "laion2b_s34b_b79k"
}

Build Container Locally

Build the Docker image:

docker build -t ingrain-server .

Run the Docker container:

docker run --name ingrain_server -p 8686:8686 -p 8687:8687 --gpus all ingrain-server

Performance test

You can run the benchmark script to test the performance of the server:

Install the python client:

pip install ingrain

Run the benchmark script:

python benchmark.py

It will output some metrics about the inference speed of the server.

{"message":"Model intfloat/e5-small-v2 is already loaded."}
Benchmark results:
Concurrent threads: 500
Requests per thread: 20
Total requests: 10000
Total benchmark time: 9.31 seconds
QPS: 1074.66
Mean response time: 0.3595 seconds
Median response time: 0.3495 seconds
Standard deviation of response times: 0.1174 seconds
Mean inference time: 235.5968 ms
Median inference time: 227.6743 ms
Standard deviation of inference times: 84.8669 ms

Development Setup

Requires Docker and Python to be installed.

Create a virtual environment

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run the Triton Inference Server

bash run_triton_server_dev.sh

Run the FastAPI server

uvicorn inference_server:app --host 127.0.0.1 --port 8686 --reload

uvicorn model_server:app --host 127.0.0.1 --port 8687 --reload

Testing

Install pytest:

pip install pytest

Unit tests

pytest

Integration tests and unit tests

pytest --integration

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docker		docker
examples		examples
inference		inference
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.base_amd64		Dockerfile.base_amd64
Dockerfile.base_arm64		Dockerfile.base_arm64
README.md		README.md
benchmark.py		benchmark.py
build_base_images.sh		build_base_images.sh
build_images.sh		build_images.sh
export_clip_preprocessors.py		export_clip_preprocessors.py
hello.sh		hello.sh
inference_server.py		inference_server.py
model_server.py		model_server.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_triton_server_dev.sh		run_triton_server_dev.sh
start.sh		start.sh
supervisord.conf		supervisord.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ingrain Server

What does it do?

How does it perform?

How compatible is it?

Usage

Example Requests and Responses

Loading a sentence transformer model `POST /load_sentence_transformer_model`:

Loading a CLIP model `POST /load_clip_model`:

Inference request `POST /infer`:

Get inference metrics `GET /metrics`:

Unloading a model `POST /unload_model`:

Delete a model `POST /delete_model`:

Build Container Locally

Performance test

Development Setup

Create a virtual environment

Run the Triton Inference Server

Run the FastAPI server

Testing

Unit tests

Integration tests and unit tests

About

Releases

Packages

Languages

OwenPendrighElliott/ingrain_server

Folders and files

Latest commit

History

Repository files navigation

Ingrain Server

What does it do?

How does it perform?

How compatible is it?

Usage

Example Requests and Responses

Loading a sentence transformer model POST /load_sentence_transformer_model:

Loading a CLIP model POST /load_clip_model:

Inference request POST /infer:

Get inference metrics GET /metrics:

Unloading a model POST /unload_model:

Delete a model POST /delete_model:

Build Container Locally

Performance test

Development Setup

Create a virtual environment

Run the Triton Inference Server

Run the FastAPI server

Testing

Unit tests

Integration tests and unit tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Loading a sentence transformer model `POST /load_sentence_transformer_model`:

Loading a CLIP model `POST /load_clip_model`:

Inference request `POST /infer`:

Get inference metrics `GET /metrics`:

Unloading a model `POST /unload_model`:

Delete a model `POST /delete_model`:

Packages