PhotoPrism: Computer Vision Models

This repository provides web services with advanced computer vision models that can be used with PhotoPrism and other applications.

Models

The currently integrated models, each with its own endpoint, are kosmos-2, vit-gpt2-image-captioning, and blip-image-captioning large:

Kosmos-2

Komsos-2 is the most accurate model of the three. It was developed by Microsoft, and this application uses the transformers implementation of the original model, as described in its Huggingface. This model was released in June 2023, and offers object detection and spatial reasoning. Kosmos-2 has very accurate accurate image captions (a .04-.1 increase in clip score when compared to the other two models offered), and is the default model used.

VIT-GPT2

This model was released by nlpconnect. This model combined VIT and GPT-2 to create a multi-modal image captioning model. I have found this to be the least performing of the three, but your mileage may vary.

BLIP

This model was released by Salesforce in 2022. The primary purpose for this model was to increase both image understanding and text generation using novel techniques. It has achieved a +2.8% CIDEr result, and I've found this model to be more performant than VIT-GPT2, but Kosmos-2 to be slightly better (a .4 increase in CLIP score).

Dependencies

Flask

Flask is the framework that is used for the API. It allows for API creation with Python, which is key for this application as it utilizes ML.

PyTorch

PyTorch is key for working with the ML models to generate the outputs. It also enables GPU processing, speeding up the image processing with the models. PyTorch primarily creates and handles tensors, which are crucial for the function of the models.

Transformers

Transformers is used for downloading and loading the models. In addition to this it is used in the image processing with the models.

Pillow

Pillow is used to take the supplied URl and convert it into the format needed to input into the models.

Hardware Acceleration Libraries

Numpy could be used for further hardware acceleration. It isn't included in the application by default to save space and keep from installing unnecessary dependencies. Numpy can be configured to use the GPU for computations. PyTorch already enables GPU processing, so numpy may not make a signficant difference.

Build Setup

Before installing the Python dependencies, please make sure that you have Git and Python 3.12+ (incl. pip) installed on your system, e.g. by running the following command on Ubuntu/Debian Linux:

sudo apt-get install -y git python3 python3-pip python3-venv python3-wheel

You can then install the required libraries in a virtual environment by either using the Makefiles we provide (i.e. run make in the main project directory or a subdirectory) or by manually running the following commands in a service directory, for example:

git clone git@github.com:photoprism/photoprism-vision.git
cd photoprism-vision/describe
python3 -m venv ./venv
. ./venv/bin/activate
./venv/bin/pip install --disable-pip-version-check --upgrade pip
./venv/bin/pip install --disable-pip-version-check -r requirements.txt

Usage

Run the Python file app.py in the describe subdirectory to start the describe service after you have installed the dependencies (more services, e.g. for OCR and tag generation, may follow):

./venv/bin/python app.py

The service then listens on port 5000 by default and its API endpoints for generating captions support both GET and POST requests. It can be tested with the curl command (curl.exe on Windows) as shown in the example below:

curl -v -H "Content-Type: application/json" \
  --data '{"url":"https://dl.photoprism.app/img/team/avatar.jpg"}' \
  -X POST http://localhost:5000/api/v1/vision/describe

At a minimum, a valid image url must be specified for this. In addition, a model name and an arbitrary id can be passed. The API will return the same id in the response. If no id is passed, a randomly generated UUID will be returned instead.

If your client submits POST requests, the request body must be JSON-encoded, e.g.:

{
    "id": "3487da77-246e-4b4c-9437-67507177bcd7",
    "url": "https://dl.photoprism.app/img/team/avatar.jpg"
}

Alternatively, you can perform GET requests with URL-encoded query parameters, which is easier to test without an HTTP client:

http://localhost:5000/api/v1/vision/describe?url=https%3A%2F%2Fdl.photoprism.app%2Fimg%2Fteam%2Favatar.jpg&id=3487da77-246e-4b4c-9437-67507177bcd7

API Endpoints

`/api/v1/vision/describe`

This is the default endpoint of the API. An image url should be passed in with the key "url", and optionally a "model" and/or "id" value can be passed in. The "model" key allows the user to specify which of the three models they would like to use. If no model is given, the application will default to using the kosmos-2 model.

`/api/v1/vision/describe/kosmos-2/patch14-224`

This is the endpoint for the Kosmos-2 model. An image url should be passed in with the key "url", and optionally a "model" and/or "id" value can be passed in.

`/api/v1/vision/describe/vit-gpt2-image-captioning`

This is the endpoint for the VIT GPT-2 model. An image url should be passed in with the key "url", and optionally an "id" value can be passed in.

`/api/v1/vision/describe/blip-image-captioning-large`

This is the endpoint for the BLIP model. An image url should be passed in with the key "url", and an "id" value can be passed in.

Example Request

POST /api/v1/vision/describe

{
    "id": "b0db2187-7a09-438c-8649-a9c6c0f7b8a1",
    "model": "kosmos-2"
    "url": "https://dl.photoprism.app/img/team/avatar.jpg",
}

Example Response

{
    "id": "b0db2187-7a09-438c-8649-a9c6c0f7b8a1",
    "model": {
        "name": "kosmos-2",
        "version": "patch14-224"
    },
    "result": {
        "caption": "An image of a man in a suit smiling."
    }
}

Code Structure

Model Loading and Initialization

MODEL_DIR = "models"
KOSMOS_MODEL_PATH = os.path.join(MODEL_DIR, "kosmos-2-patch14-224")
VIT_MODEL_PATH = os.path.join(MODEL_DIR, "vit-gpt2-image-captioning")
BLIP_MODEL_PATH = os.path.join(MODEL_DIR, "blip-image-captioning-large")

This code block creates the paths for the models. This will be useful when downloading/loading the models. It uses os.path to assemble the correct path depending on if the system is Windows-based or UNIX-based.

Downloading Models

def download_model(model_name, save_path):
    if not os.path.exists(save_path):
        print(f"Downloading {model_name}...")
        if model_name == "microsoft/kosmos-2-patch14-224":
            AutoModelForVision2Seq.from_pretrained(model_name).save_pretrained(save_path)
            AutoProcessor.from_pretrained(model_name).save_pretrained(save_path)
        elif model_name == "nlpconnect/vit-gpt2-image-captioning":
            VisionEncoderDecoderModel.from_pretrained(model_name).save_pretrained(save_path)
            ViTImageProcessor.from_pretrained(model_name).save_pretrained(save_path)
            AutoTokenizer.from_pretrained(model_name).save_pretrained(save_path)
        elif model_name == "Salesforce/blip-image-captioning-large":
            BlipForConditionalGeneration.from_pretrained(model_name).save_pretrained(save_path)
            BlipProcessor.from_pretrained(model_name).save_pretrained(save_path)
        print(f"{model_name} downloaded and saved to {save_path}")
    else:
        print(f"{model_name} already exists at {save_path}")

Here the code is checking if the models already exist or not. If they don't exist it is downloading them, if they do it is skipping the downloading.

os.makedirs(MODEL_DIR, exist_ok=True)
download_model("microsoft/kosmos-2-patch14-224", KOSMOS_MODEL_PATH)
download_model("nlpconnect/vit-gpt2-image-captioning", VIT_MODEL_PATH)
download_model("Salesforce/blip-image-captioning-large", BLIP_MODEL_PATH)

Here the code is downloading the models by calling the function in the previous block.

Loading Models

print("Loading models...")
kosmosModel = AutoModelForVision2Seq.from_pretrained(KOSMOS_MODEL_PATH)
kosmosProcessor = AutoProcessor.from_pretrained(KOSMOS_MODEL_PATH)

vitModel = VisionEncoderDecoderModel.from_pretrained(VIT_MODEL_PATH)
vitFeature_extractor = ViTImageProcessor.from_pretrained(VIT_MODEL_PATH)
vitTokenizer = AutoTokenizer.from_pretrained(VIT_MODEL_PATH)

blipProcessor = BlipProcessor.from_pretrained(BLIP_MODEL_PATH)
blipModel = BlipForConditionalGeneration.from_pretrained(BLIP_MODEL_PATH)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vitModel.to(device)

Here the models are being loaded after they have been saved.

Services

def kosmosGenerateResponse(url):
    try:
        image = Image.open(requests.get(url, stream=True).raw)
    except Exception as e:
        return "fetchError", f"Unable to fetch image: {str(e)}"

    prompt = "<grounding>An image of"

    try:
        inputs = kosmosProcessor(text=prompt, images=image, return_tensors="pt")
        generated_ids = kosmosModel.generate(
            pixel_values=inputs["pixel_values"],
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            image_embeds=None,
            image_embeds_position_mask=inputs["image_embeds_position_mask"],
            use_cache=True,
            max_new_tokens=128,
        )

        generated_text = kosmosProcessor.batch_decode(generated_ids, skip_special_tokens=True)[0]
        processed_text, entities = kosmosProcessor.post_process_generation(generated_text)
    except Exception as e:
        return "processingError", f"Error during processing: {str(e)}"

    return "ok", processed_text

def vitGenerateResponse(url):
    vitModel.to(device)    

    max_length = 16
    num_beams = 4
    gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

    def predict_step(url):
        image = Image.open(requests.get(url, stream=True).raw)
        images = []

        if image.mode != "RGB":
            image = image.convert(mode="RGB")

        images.append(image)

        pixel_values = vitFeature_extractor(images=images, return_tensors="pt").pixel_values
        pixel_values = pixel_values.to(device)

        output_ids = vitModel.generate(pixel_values, **gen_kwargs)

        preds = vitTokenizer.batch_decode(output_ids, skip_special_tokens=True)
        preds = [pred.strip() for pred in preds]
        return preds

    processed_text = predict_step(url)  # returns prediction

    return "ok", processed_text

def blipGenerateResponse(url):
    img_url = url
    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

    inputs = blipProcessor(raw_image, return_tensors="pt")

    out = blipModel.generate(**inputs)
    processed_text = blipProcessor.decode(out[0], skip_special_tokens=True)

    return "ok", processed_text

These are the services to generate the captions. There is a function for each model.

Request Handlers

Default Endpoint

@app.route('/api/v1/vision/describe', methods=['POST', 'GET'])
def generateResponse():
    if request.method == 'POST':
        if not request.is_json:
            return jsonify({"error": "Request must be JSON"}), 400
        data = request.get_json()
    elif request.method == 'GET':
        data = request.args

    url = data.get('url')
    model = data.get('model')
    id = data.get('id')

    if not url:
        return jsonify({"error": "URL is required"}), 400
    
    if model == "kosmos-2" or not model:
        status, result = kosmosGenerateResponse(url)
        if status == "fetchError":
            return jsonify({"error": result}), 500
        elif status == "processingError":
            return jsonify({"error": result}), 500
        elif status == "ok":
            if id:
                return jsonify({"id": id, "result": {"caption": result}, "model": {"name": "kosmos-2", "version": "patch14-224"}}), 200
            return jsonify({"id": uuid.uuid4(), "result": {"caption": result}, "model": {"name": "kosmos-2", "version": "patch14-224"}}), 200
    elif model == "vit-gpt2-image-captioning":
        status, result = vitGenerateResponse(url)
        if status == "ok":
            if id:
                return jsonify({"id": id, "result": {"caption": result}, "model": {"name": model, "version": "latest"}}), 200
            return jsonify({"id": uuid.uuid4(), "result": {"caption": result}, "model": {"name": model, "version": "latest"}}), 200
        return jsonify({"error": "Error during processing"})
    elif model == "blip-image-captioning-large":
        status, result = blipGenerateResponse(url)
        if status =='ok':
            if id:
                return jsonify({"id": id, "result": {"caption": result}, "model": {"name": model, "version": "latest"}}), 200
            return jsonify({"id": uuid.uuid4(), "result": {"caption": result}, "model": {"name": model, "version": "latest"}}), 200
        return jsonify({"error": "Error during processing"})

This is the default endpoint. It checks to see if a model is specified, and if it is it calls the service associated with that model and returns the respose with the data. If a model isn't specified it uses kosmos-2.

Specific Endpoints

@app.route('/api/v1/vision/describe/kosmos-2/patch14-224', methods=['POST', 'GET'])
def kosmosController():
    if request.method == 'POST':
        if not request.is_json:
            return jsonify({"error": "Request must be JSON"}), 400
        data = request.get_json()
    elif request.method == 'GET':
        data = request.args

    url = data.get('url')
    id = data.get('id')

    if not url:
        return jsonify({"error": "URL is required"}), 400
    
    status, result = kosmosGenerateResponse(url)

    if status == "fetchError":
        return jsonify({"error": result}), 500
    elif status == "processingError":
        return jsonify({"error": result}), 500
    elif status == "ok":
        if id:
            return jsonify({"id": id, "result": {"caption": result}, "model": {"name": "kosmos-2", "version": "patch14-224"}}), 200
        return jsonify({"id": uuid.uuid4(), "result": {"caption": result}, "model": {"name": "kosmos-2", "version": "patch14-224"}}), 200

    


@app.route('/api/v1/vision/describe/vit-gpt2-image-captioning', methods=['POST', 'GET'])
def vitController():
    if request.method == 'POST':
        if not request.is_json:
            return jsonify({"error": "Request must be JSON"}), 400
        data = request.get_json()
    elif request.method == 'GET':
        data = request.args
    
    url = data.get('url')
    id = data.get('id')

    if not url:
        return jsonify({"error": "URL is required"}), 400
    
    status, result = vitGenerateResponse(url)

    if status == "ok":
        if id:
            return jsonify({"id": id, "result": {"caption": result}, "model": {"name": "vit-gpt2-image-captioning", "version": "latest"}}), 200
        return jsonify({"id": uuid.uuid4(), "result": {"caption": result}, "model": {"name": "vit-gpt2-image-captioning", "version": "latest"}}), 200
    
    return jsonify({"error": "Error during processing"})



@app.route('/api/v1/vision/describe/blip-image-captioning-large', methods=['POST', 'GET'])
def blipController():
    if request.method == 'POST':
        if not request.is_json:
            return jsonify({"error": "Request must be JSON"}), 400
        data = request.get_json()
    elif request.method == 'GET':
        data = request.args

    url = data.get('url')
    id = data.get('id')

    if not url:
        return jsonify({"error": "URL is required"}), 400
    
    status, result = blipGenerateResponse(url)

    if status == "ok":
        if id:
            return jsonify({"id": id, "result": {"caption": result}, "model": {"name": "blip-image-captioning-large", "version": "latest"}}), 200
        return jsonify({"id": uuid.uuid4(), "result": {"caption": result}, "model": {"name": "blip-image-captioning-large", "version": "latest"}}), 200
    
    return jsonify({"error", "Error during processing"})

These are the endpoints for each model. They do some error handling, run the service, and return the response.

Contributors

We would like to thank everyone involved, especially Aatif Dawawala who got things rolling and contributed much of the initial code:

Learn more ›

Submitting Pull Requests

Follow our step-by-step guide to learn how to submit new features, bug fixes, and documentation enhancements.

Learn more ›

License and Disclaimer

The files in this repository are licensed under the Apache License, Version 2.0 (the “License”).

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Learn more ›

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
describe		describe
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.sh		build.sh
compose.yaml		compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhotoPrism: Computer Vision Models

Table of Contents

Models

Kosmos-2

VIT-GPT2

BLIP

Dependencies

Flask

PyTorch

Transformers

Pillow

Hardware Acceleration Libraries

Build Setup

Usage

API Endpoints

`/api/v1/vision/describe`

`/api/v1/vision/describe/kosmos-2/patch14-224`

`/api/v1/vision/describe/vit-gpt2-image-captioning`

`/api/v1/vision/describe/blip-image-captioning-large`

Example Request

Example Response

Code Structure

Model Loading and Initialization

Downloading Models

Loading Models

Services

Request Handlers

Default Endpoint

Specific Endpoints

Contributors

Submitting Pull Requests

License and Disclaimer

About

Releases

Contributors 2

Languages

License

photoprism/photoprism-vision

Folders and files

Latest commit

History

Repository files navigation

PhotoPrism: Computer Vision Models

Table of Contents

Models

Kosmos-2

VIT-GPT2

BLIP

Dependencies

Flask

PyTorch

Transformers

Pillow

Hardware Acceleration Libraries

Build Setup

Usage

API Endpoints

/api/v1/vision/describe

/api/v1/vision/describe/kosmos-2/patch14-224

/api/v1/vision/describe/vit-gpt2-image-captioning

/api/v1/vision/describe/blip-image-captioning-large

Example Request

Example Response

Code Structure

Model Loading and Initialization

Downloading Models

Loading Models

Services

Request Handlers

Default Endpoint

Specific Endpoints

Contributors

Submitting Pull Requests

License and Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Contributors 2

Languages

`/api/v1/vision/describe`

`/api/v1/vision/describe/kosmos-2/patch14-224`

`/api/v1/vision/describe/vit-gpt2-image-captioning`

`/api/v1/vision/describe/blip-image-captioning-large`