GitHub - rhysdg/vision-at-a-clip: Low-latency ONNX and TensorRT based zero-shot classification and detection with contrastive language-image pre-training based prompts

Vision at a CLIP: Multimodal Vision at Speed

Low-latency image rationalization and search with contrastive language-image pre-training

CLIP | SigLIP | GroundingDINO | X-CLIP

Explore the docs »

*GroundingDINO ONNX prompted with: "spaceman. spacecraft. water. clouds. space helmet. glove"

Report Bug . Request Feature

About The Project

Built With

Onnxruntime

The Story So Far

So what's going on here? There have been a lot of developments in the last year or so with deployable foundation model - keeping up is difficult so the idea is to have a one-stop shop for a few things:

A concerted class - and eventually a Python package - allowing for the deployment of an ONNX accelarated representation of CLIP and it's latest improved variants - like SigLIP - in conjunction with SAM (Segment Anything) as a multi-modal localisation and labelling tool.

You'll find that all models and pipelines are available to you as separate tools too - should you wish to classify with SigLIP alone then all good, SAM by itself - no problem

Last of all the aim here is to keep up with the latest optimised foundation models as we go. This includes optimised postprocessing and an test time augmentations that can help with inference quality. Most importantly the aim is to ensure that ONNX and TensorRT representations are available for use. So far we have:

Open AI's original CLIP - ViT-B/32 based converted to ONNX with full inference class
Siglip ONNX - FP16 with a quantized variant around the corner, TRT is in our future scope too.
GroundingDINO - Zero-shot object detection - Swin T based with a bert incased text encoder, converted to ONNX, FP32, mixed precision (dynamic quant shortly), with a full inference API
Segment Anything ONNX - TRT on it's way

Getting Started:

Right now installation is as simple as the command below in a virtual envirionment from the root of this project - see the notebook referenced below for a live demo:
```
git clone https://github.com/rhysdg/sam-at-a-clip.git
pip install -r requirements.txt
```
SigLIP is available and recommended by default given the innovation made at with it's loss function leading to better inference. model types however can be changed at instantiation with:
```
onnx_model = OnnxLip(batch_size=16, type='siglip_full')
```
Notice also cosine similrity at get_similarity_scores is adusted to handle multiple context - in other words a handful of text embedding can be sent as 'contexts', and send to the function to be evaluated against a single image or a batch of images.
hidden states are also available at onnx_model.hidden_image and onnx_model.hidden_text when using type=siglip for extraction only - allowing for analysis, attention plotting and multi-point processing as input to SAM. Watch this space for more on this.
Not also that an OnnxSAM class is also available with the same instantiation and automatic model download - further examples are on their way along with SigLIP integration

Example usage:

CLIP/SigLIP

For the full 384 SigLIP model go ahead and use the .inference method as follows. Noting that CLIP is avaiable via the same method. Either model will switch between softmax and sigmoid accordingly:

from PIL import Image
from sam.model import OnnxSAM
from clip.model import OnnxLip, softmax, get_probabilities

images = [Image.open("images/dog.jpg").convert("RGB")]

texts = {"classification":  ["a photo of space",
                            "a photo of a dog",
                            "a photo of a dog with flowers laying on grass",
                            "a photo of a brown and white dog with blue flowers laying on grass",
                            "a photo of a brown and white dog with yellow flowers laying on grass"],
    }

#type='clip' is also avvaiilable with this usage    
onnx_model = OnnxLip(batch_size=16, type='siglip_full')
probs, _ = onnx_model.inference(images, texts)

for k,v in texts.items():
    print(f'\ncontext: {k}\n')
    for text, p in zip(texts[k], probs[k]):
        print(f"Probability that the image is '{text}': {p:.3f}")

For cosine similarity based models manual extraction as a precursor can be used as follows (noting that SigLIP text and image encoders arre available despite a different loss):

from PIL import Image
from sam.model import OnnxSAM
from clip.model import OnnxLip, softmax, get_probabilities


images = [Image.open("images/dog.jpg").convert("RGB")]

texts = {"classification": ["a photo of a man", "a photo of a woman", "s photo of a dog"],
        "situational": ["a dog standing up", "a dog running", "a dog laying on grass"],
    }


onnx_model = OnnxLip(batch_size=16, type='clip')

image_embeddings = onnx_model.get_image_embeddings(images)
text_embeddings_class = onnx_model.get_text_embeddings(texts['classification'])
text_embeddings_situational = onnx_model.get_text_embeddings(texts['situational'])


contexts = {"classification": text_embeddings_class,
            "situational": text_embeddings_situational,
          }

probs, logits = get_probabilities(image_embeddings, contexts)

for k,v in contexts.items():
    print(f'\ncontext: {k}\n')
    for text, p in zip(texts[k], probs[k]):
        print(f"Probability that the image is '{text}': {p:.3f}")

Grounding DINO

For zero-shot object detection go ahead and build from the following example:

import os
import time
import logging
import torch
import numpy as np
from gdino.model import OnnxGDINO
from  utils.gdino_utils import load_image, viz

logging.basicConfig(level=logging.INFO)

output_dir = 'output'

#modest speedup with TensorRT 10.0.1.6-1 and fp16, amplitude hw currently
#torch with amp autocast and matmul enhancements at 'high' is still faster currently 
ogd = OnnxGDINO(type='gdino_fp32', trt=True)

payload = ogd.preprocess_query("spaceman. spacecraft. water. clouds. space helmet. glove")
img, img_transformed = load_image('images/wave_planet.webp')

img.save(os.path.join(output_dir, "pred.jpg"))

filtered_boxes, predicted_phrases = ogd.inference(img_transformed.astype(np.float32), 
                                                  payload,
                                                  text_threshold=0.25, 
                                                  box_threshold=0.35,)

size = img.size
pred_dict = {
    "boxes": filtered_boxes,
    "size": [size[1], size[0]],
    "labels": predicted_phrases,
}

predictions = viz(img, 
                  pred_dict,
                  label_size=25,
                  bbox_thickness=6
                  )[0]

predictions.save(os.path.join(output_dir, "pred.jpg"))

Customisation:

Coming soon

Notebooks

SigLIP Onnx - A simple railed overview of the usage above - noting that the Huggingface AutoTokenizer is pretty verbose and cost heavy, work is underway to bring a numpy only solution in shorlty

Tools and Scripts

Coming soon

Testing

CI/CD will be expanded as we go - all general instantiation test pass so far.

Models & Latency benchmarks

**All downloadable model are in .onnx format - noting that these are automatically secured with OnnxClip and OnnxSAM too

model	CLIP Score	Deployment	speed(ms)	TensorRT FP16 status	ms (FP16)	FPS (quantized)
SigLIP 384 FP16 - text	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending
SigLIP 384 FP16 - image	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending
CLIP vitb32 - text	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending
CLIP vitb32 - image	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending
CLIP Surgery vitb32	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending
CLIP Surgery vitb32	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending

model	Score	Deployment	speed(ms)	TensorRT FP16 status	ms (FP16)	FPS (quantized)
SAM ViT-L ONNX - encoder	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending
SAM ViT-L ONNX - decoder	pending	RTX 3080 AGX Xavier	pending pending	pending pass	pending pending	pending pending

Similar projects

Pending

Latest Updates

Added a Gradio example app. Ignore the percentages for now or rather think of each as the pairwise confidence for an image and a particular prompt - the results of simgoid outputs don't sum to 1 in concert. The results however are pretty impressive! - simply run python3 app.py from your root and head to http://127.0.0.1:7860/
For variant manual SigLIP conversion see the following issue

Future updates

~~Example Gradio app~~ - done
~~Deprecating Huggingface dependency - standalone siglip tokenization for lightweight deployments~~ - done
~~Grounding Dino ONNX - possibly a better solution than sam here for localisation - prompts are built in too~~
Python packaging - scheduled
TensorRT - pending
CUDA accelarated SigLIP based vector search with chromadb - pending
ollama support - pending

Contact

Project link: https://github.com/rhysdg/sam-at-a-clip
Email: Rhys

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision at a CLIP: Multimodal Vision at Speed

Table of Contents

About The Project

Built With

The Story So Far

Getting Started:

Example usage:

CLIP/SigLIP

Grounding DINO

Customisation:

Notebooks

Tools and Scripts

Testing

Models & Latency benchmarks

Similar projects

Latest Updates

Future updates

Contact

About

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
.github/workflows		.github/workflows
clip		clip
gdino		gdino
images		images
notebooks		notebooks
output		output
sam		sam
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

rhysdg/vision-at-a-clip

Folders and files

Latest commit

History

Repository files navigation

Vision at a CLIP: Multimodal Vision at Speed

Table of Contents

About The Project

Built With

The Story So Far

Getting Started:

Example usage:

CLIP/SigLIP

Grounding DINO

Customisation:

Notebooks

Tools and Scripts

Testing

Models & Latency benchmarks

Similar projects

Latest Updates

Future updates

Contact

About

Topics

Resources

Stars

Watchers

Forks

Packages 0

Languages

Packages