With the increasing volume of Multimedia operations in our everyday life, advancements in the digital space have become essential. Captioning is a natural language processing task that can revolutionize this segment. For instance, Web browsing greatly relies on finding the tags/titles of media. Image and Video Captioning would allow automatic annotation of the occurrences. This would not only facilitate the development of efficient search algorithms that search by contents of the media but would also help in the design of better recommendation systems for users. The goal of this project is to build Image Captioning and Video Captioning models. In the first part of the project, an Image Captioning Model is designed using Deep Learning and Encoder-Decoder architecture. In the second part, the scope of the model is expanded to Video Captioning. Flickr8k and TRECVID-VTT Data are used for the tasks. BLEU scores are calculated for the evaluation of the models. Greedy and Beam Search Algorithms are used for Real-Time Testing
Clone the repository :git clone https://github.com/Sapphirine/video_caption_generation.git
The video explaining the project can be found here
Video | Predicted Caption |
---|---|
a boy sitting on a car seat of a car | |
a man is talking to a crowd | |
a boy in a bathtub | |
a cat in a room | |
a video of a video of a person |