Skip to content

Low-latency ONNX and TensorRT based zero-shot classification and detection with contrastive language-image pre-training based prompts

Notifications You must be signed in to change notification settings

rhysdg/vision-at-a-clip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Apache LinkedIn


Vision at a CLIP: Multimodal Vision at Speed

Low-latency image rationalization and search with contrastive language-image pre-training

CLIP | SigLIP | GroundingDINO | X-CLIP

Explore the docs »


*GroundingDINO ONNX prompted with: "spaceman. spacecraft. water. clouds. space helmet. glove"

Report Bug . Request Feature

Table of Contents

About The Project

Built With

The Story So Far

So what's going on here? There have been a lot of developments in the last year or so with deployable foundation model - keeping up is difficult so the idea is to have a one-stop shop for a few things:

A concerted class - and eventually a Python package - allowing for the deployment of an ONNX accelarated representation of CLIP and it's latest improved variants - like SigLIP - in conjunction with SAM (Segment Anything) as a multi-modal localisation and labelling tool.

You'll find that all models and pipelines are available to you as separate tools too - should you wish to classify with SigLIP alone then all good, SAM by itself - no problem

Last of all the aim here is to keep up with the latest optimised foundation models as we go. This includes optimised postprocessing and an test time augmentations that can help with inference quality. Most importantly the aim is to ensure that ONNX and TensorRT representations are available for use. So far we have:

  • Open AI's original CLIP - ViT-B/32 based converted to ONNX with full inference class
  • Siglip ONNX - FP16 with a quantized variant around the corner, TRT is in our future scope too.
  • GroundingDINO - Zero-shot object detection - Swin T based with a bert incased text encoder, converted to ONNX, FP32, mixed precision (dynamic quant shortly), with a full inference API
  • Segment Anything ONNX - TRT on it's way

Getting Started:

  • Right now installation is as simple as the command below in a virtual envirionment from the root of this project - see the notebook referenced below for a live demo:

    git clone https://github.com/rhysdg/sam-at-a-clip.git
    pip install -r requirements.txt
    
  • SigLIP is available and recommended by default given the innovation made at with it's loss function leading to better inference. model types however can be changed at instantiation with:

    onnx_model = OnnxLip(batch_size=16, type='siglip_full')
  • Notice also cosine similrity at get_similarity_scores is adusted to handle multiple context - in other words a handful of text embedding can be sent as 'contexts', and send to the function to be evaluated against a single image or a batch of images.

  • hidden states are also available at onnx_model.hidden_image and onnx_model.hidden_text when using type=siglip for extraction only - allowing for analysis, attention plotting and multi-point processing as input to SAM. Watch this space for more on this.

  • Not also that an OnnxSAM class is also available with the same instantiation and automatic model download - further examples are on their way along with SigLIP integration

Example usage:

CLIP/SigLIP

  • For the full 384 SigLIP model go ahead and use the .inference method as follows. Noting that CLIP is avaiable via the same method. Either model will switch between softmax and sigmoid accordingly:

    from PIL import Image
    from sam.model import OnnxSAM
    from clip.model import OnnxLip, softmax, get_probabilities
    
    images = [Image.open("images/dog.jpg").convert("RGB")]
    
    texts = {"classification":  ["a photo of space",
                                "a photo of a dog",
                                "a photo of a dog with flowers laying on grass",
                                "a photo of a brown and white dog with blue flowers laying on grass",
                                "a photo of a brown and white dog with yellow flowers laying on grass"],
        }
    
    #type='clip' is also avvaiilable with this usage    
    onnx_model = OnnxLip(batch_size=16, type='siglip_full')
    probs, _ = onnx_model.inference(images, texts)
    
    for k,v in texts.items():
        print(f'\ncontext: {k}\n')
        for text, p in zip(texts[k], probs[k]):
            print(f"Probability that the image is '{text}': {p:.3f}")
  • For cosine similarity based models manual extraction as a precursor can be used as follows (noting that SigLIP text and image encoders arre available despite a different loss):

    from PIL import Image
    from sam.model import OnnxSAM
    from clip.model import OnnxLip, softmax, get_probabilities
    
    
    images = [Image.open("images/dog.jpg").convert("RGB")]
    
    texts = {"classification": ["a photo of a man", "a photo of a woman", "s photo of a dog"],
            "situational": ["a dog standing up", "a dog running", "a dog laying on grass"],
        }
    
    
    onnx_model = OnnxLip(batch_size=16, type='clip')
    
    image_embeddings = onnx_model.get_image_embeddings(images)
    text_embeddings_class = onnx_model.get_text_embeddings(texts['classification'])
    text_embeddings_situational = onnx_model.get_text_embeddings(texts['situational'])
    
    
    contexts = {"classification": text_embeddings_class,
                "situational": text_embeddings_situational,
              }
    
    probs, logits = get_probabilities(image_embeddings, contexts)
    
    for k,v in contexts.items():
        print(f'\ncontext: {k}\n')
        for text, p in zip(texts[k], probs[k]):
            print(f"Probability that the image is '{text}': {p:.3f}")

Grounding DINO

  • For zero-shot object detection go ahead and build from the following example:

    import os
    import time
    import logging
    import torch
    import numpy as np
    from gdino.model import OnnxGDINO
    from  utils.gdino_utils import load_image, viz
    
    logging.basicConfig(level=logging.INFO)
    
    output_dir = 'output'
    
    #modest speedup with TensorRT 10.0.1.6-1 and fp16, amplitude hw currently
    #torch with amp autocast and matmul enhancements at 'high' is still faster currently 
    ogd = OnnxGDINO(type='gdino_fp32', trt=True)
    
    payload = ogd.preprocess_query("spaceman. spacecraft. water. clouds. space helmet. glove")
    img, img_transformed = load_image('images/wave_planet.webp')
    
    img.save(os.path.join(output_dir, "pred.jpg"))
    
    filtered_boxes, predicted_phrases = ogd.inference(img_transformed.astype(np.float32), 
                                                      payload,
                                                      text_threshold=0.25, 
                                                      box_threshold=0.35,)
    
    size = img.size
    pred_dict = {
        "boxes": filtered_boxes,
        "size": [size[1], size[0]],
        "labels": predicted_phrases,
    }
    
    predictions = viz(img, 
                      pred_dict,
                      label_size=25,
                      bbox_thickness=6
                      )[0]
    
    predictions.save(os.path.join(output_dir, "pred.jpg"))
      

Customisation:

  • Coming soon

Notebooks

  1. SigLIP Onnx - A simple railed overview of the usage above - noting that the Huggingface AutoTokenizer is pretty verbose and cost heavy, work is underway to bring a numpy only solution in shorlty

Tools and Scripts

  1. Coming soon

Testing

  • CI/CD will be expanded as we go - all general instantiation test pass so far.

Models & Latency benchmarks

**All downloadable model are in .onnx format - noting that these are automatically secured with OnnxClip and OnnxSAM too

model CLIP Score Deployment speed(ms) TensorRT FP16 status ms (FP16) FPS (quantized)
SigLIP 384 FP16 - text pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending
SigLIP 384 FP16 - image pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending
CLIP vitb32 - text pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending
CLIP vitb32 - image pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending
CLIP Surgery vitb32 pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending
CLIP Surgery vitb32 pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending
model Score Deployment speed(ms) TensorRT FP16 status ms (FP16) FPS (quantized)
SAM ViT-L ONNX - encoder pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending
SAM ViT-L ONNX - decoder pending RTX 3080
AGX Xavier
pending
pending
pending
pass
pending
pending
pending
pending

Similar projects

  • Pending

Latest Updates

  • Added a Gradio example app. Ignore the percentages for now or rather think of each as the pairwise confidence for an image and a particular prompt - the results of simgoid outputs don't sum to 1 in concert. The results however are pretty impressive! - simply run python3 app.py from your root and head to http://127.0.0.1:7860/

    alt text

  • For variant manual SigLIP conversion see the following issue

Future updates

  • Example Gradio app - done
  • Deprecating Huggingface dependency - standalone siglip tokenization for lightweight deployments - done
  • Grounding Dino ONNX - possibly a better solution than sam here for localisation - prompts are built in too
  • Python packaging - scheduled
  • TensorRT - pending
  • CUDA accelarated SigLIP based vector search with chromadb - pending
  • ollama support - pending

Contact

About

Low-latency ONNX and TensorRT based zero-shot classification and detection with contrastive language-image pre-training based prompts

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published