Skip to content

AliAlfatemi/CV-for-SC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Image Captioning with Vision Transformer and LLMS

This repository contains the implementation of an image captioning model that integrates Vision Transformer (ViT) and GPT-J to generate descriptive captions for images. The model is built using the Hugging Face Transformers library and is trained on the COCO dataset.

Project Overview

The project aims to explore the capabilities of combining advanced vision and language models to generate accurate and contextually relevant descriptions of images. The VisionEncoderDecoder framework is used to fuse the ViT model as the encoder and GPT-J as the decoder.

Getting Started

Prerequisites

  • Python 3.8 or above
  • PyTorch 1.8 or above
  • Transformers 4.0 or above
  • Datasets
  • PIL
  • Pandas
  • NumPy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages