-
Notifications
You must be signed in to change notification settings - Fork 213
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #872 from Anushka-Pote/main
Image Caption Generation with Audio Outputs (Issue #869)
- Loading branch information
Showing
5 changed files
with
277 additions
and
0 deletions.
There are no files selected for viewing
91 changes: 91 additions & 0 deletions
91
Deep_Learning/Image Caption Generation with Audio Output/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# Image Caption Generator with TTS | ||
|
||
This project is a web application that allows users to upload images and generate captions using a pre-trained model. The generated captions can also be converted to speech using Google Text-to-Speech (gTTS), which can be played or downloaded directly from the webpage. | ||
|
||
## Features | ||
- Upload an image file and generate a caption using the `Salesforce/blip-image-captioning-base` model. | ||
- Converts the generated caption into audio using Google Text-to-Speech (gTTS). | ||
- Displays the uploaded image along with the generated caption and an audio player to listen to the caption. | ||
|
||
|
||
## Project Structure | ||
|
||
``` | ||
project/ | ||
│ | ||
├── app.py # Main Flask app | ||
├── static/ # Static files (uploads and audio) | ||
│ ├── uploads/ # Folder for uploaded images | ||
│ └── audio/ # Folder for audio files generated by gTTS | ||
├── templates/ | ||
│ └── index.html # HTML file for rendering the webpage | ||
├── requirements.txt # Python dependencies | ||
└── README.md # Project documentation | ||
``` | ||
|
||
## Installation and Setup | ||
|
||
1. **Clone the repository:** | ||
|
||
```bash | ||
git clone https://github.com/payal83/image-caption-generator.git | ||
cd image-caption-generator | ||
``` | ||
|
||
2. **Create a virtual environment:** | ||
|
||
```bash | ||
python3 -m venv venv | ||
source venv/bin/activate # On Windows: venv\Scripts\activate | ||
``` | ||
|
||
3. **Install dependencies:** | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
4. **Run the Flask application:** | ||
|
||
```bash | ||
python app.py | ||
``` | ||
|
||
5. **Open your browser and navigate to:** | ||
|
||
``` | ||
http://127.0.0.1:5000/ | ||
``` | ||
|
||
## Dependencies | ||
|
||
This project relies on the following libraries: | ||
|
||
- **Flask**: Web framework used to create the application. | ||
- **Pillow**: For image processing. | ||
- **transformers**: Hugging Face transformers library for loading the image captioning model. | ||
- **gTTS**: Google Text-to-Speech library for converting text into audio. | ||
- **Werkzeug**: Used for securing file uploads. | ||
|
||
To install the dependencies, use: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Usage | ||
|
||
1. **Upload an Image**: | ||
Upload any image file (e.g., `.jpg`, `.png`) through the web interface. | ||
|
||
2. **Generate Caption**: | ||
Once uploaded, the model will generate a caption based on the content of the image. | ||
|
||
3. **Play Caption as Audio**: | ||
The caption will also be converted to speech using Google Text-to-Speech (gTTS). An audio player will appear, allowing you to listen to the caption. | ||
|
||
|
||
## License | ||
|
||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | ||
|
56 changes: 56 additions & 0 deletions
56
Deep_Learning/Image Caption Generation with Audio Output/app.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
from flask import Flask, render_template, request, url_for | ||
from werkzeug.utils import secure_filename | ||
import os | ||
from PIL import Image | ||
from transformers import pipeline | ||
from gtts import gTTS | ||
|
||
app = Flask(__name__) | ||
|
||
# Configure upload folder | ||
app.config['UPLOAD_FOLDER'] = 'static/uploads' | ||
app.config['AUDIO_FOLDER'] = 'static/audio' | ||
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # Limit to 16 MB | ||
|
||
# Create uploads and audio directories if they don't exist | ||
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True) | ||
os.makedirs(app.config['AUDIO_FOLDER'], exist_ok=True) | ||
|
||
# Initialize the image-to-text pipeline | ||
image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base") | ||
|
||
@app.route('/', methods=['GET', 'POST']) | ||
def index(): | ||
caption = '' | ||
image_url = '' | ||
audio_url = '' | ||
|
||
if request.method == 'POST' and 'photo' in request.files: | ||
# Process the uploaded photo | ||
photo = request.files['photo'] | ||
filename = secure_filename(photo.filename) | ||
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename) | ||
photo.save(filepath) | ||
|
||
# Convert the image to RGB and process | ||
image = Image.open(filepath).convert('RGB') | ||
|
||
# Generate caption | ||
captions = image_to_text(image) | ||
caption = captions[0]['generated_text'] | ||
|
||
# Set image URL for display | ||
image_url = url_for('static', filename=f'uploads/{filename}') | ||
|
||
# Convert caption to audio using gtts | ||
if caption: | ||
tts = gTTS(text=caption, lang='en') | ||
audio_filename = f"{filename.rsplit('.', 1)[0]}.mp3" # Same name but with .mp3 extension | ||
audio_filepath = os.path.join(app.config['AUDIO_FOLDER'], audio_filename) | ||
tts.save(audio_filepath) | ||
audio_url = url_for('static', filename=f'audio/{audio_filename}') | ||
|
||
return render_template('index.html', caption=caption, image_url=image_url, audio_url=audio_url) | ||
|
||
if __name__ == '__main__': | ||
app.run(debug=True) |
7 changes: 7 additions & 0 deletions
7
Deep_Learning/Image Caption Generation with Audio Output/requirements.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Flask==2.3.2 | ||
Pillow==10.0.0 | ||
transformers==4.31.0 | ||
torch==2.0.1 | ||
gTTS==2.3.2 | ||
Werkzeug==2.3.6 | ||
gunicorn |
1 change: 1 addition & 0 deletions
1
Deep_Learning/Image Caption Generation with Audio Output/static/ignore
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
|
122 changes: 122 additions & 0 deletions
122
Deep_Learning/Image Caption Generation with Audio Output/templates/index.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
<head> | ||
<meta charset="UTF-8"> | ||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> | ||
<title>Image Caption Generator</title> | ||
<style> | ||
* { | ||
box-sizing: border-box; | ||
margin: 0; | ||
padding: 0; | ||
} | ||
|
||
body { | ||
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; | ||
background-color: #f4f4f4; | ||
color: #333; | ||
display: flex; | ||
justify-content: center; | ||
align-items: center; | ||
min-height: 100vh; | ||
padding: 20px; | ||
} | ||
|
||
.container { | ||
background-color: white; | ||
box-shadow: 0px 4px 10px rgba(0, 0, 0, 0.1); | ||
border-radius: 10px; | ||
padding: 30px; | ||
width: 100%; | ||
max-width: 600px; | ||
} | ||
|
||
h1 { | ||
font-size: 2.5em; | ||
color: #333; | ||
text-align: center; | ||
margin-bottom: 20px; | ||
} | ||
|
||
form { | ||
display: flex; | ||
flex-direction: column; | ||
gap: 15px; | ||
} | ||
|
||
input[type="file"] { | ||
border: 1px solid #ddd; | ||
padding: 10px; | ||
border-radius: 5px; | ||
font-size: 1em; | ||
cursor: pointer; | ||
} | ||
|
||
button { | ||
background-color: #4CAF50; | ||
color: white; | ||
padding: 15px 20px; | ||
border: none; | ||
border-radius: 5px; | ||
font-size: 1.2em; | ||
cursor: pointer; | ||
transition: background-color 0.3s ease; | ||
} | ||
|
||
button:hover { | ||
background-color: #45a049; | ||
} | ||
|
||
h2 { | ||
font-size: 1.8em; | ||
margin-top: 30px; | ||
color: #333; | ||
} | ||
|
||
img { | ||
max-width: 100%; | ||
border-radius: 10px; | ||
margin-top: 15px; | ||
} | ||
|
||
.caption { | ||
font-size: 1.2em; | ||
margin-top: 10px; | ||
padding: 15px; | ||
background-color: #f9f9f9; | ||
border-left: 4px solid #4CAF50; | ||
border-radius: 5px; | ||
} | ||
|
||
audio { | ||
margin-top: 20px; | ||
width: 100%; | ||
} | ||
</style> | ||
</head> | ||
<body> | ||
<div class="container"> | ||
<h1>Image Caption Generator</h1> | ||
<form action="/" method="post" enctype="multipart/form-data"> | ||
<label for="photo">Upload an image:</label> | ||
<input type="file" name="photo" accept="image/*" required> | ||
<button type="submit">Generate Caption</button> | ||
</form> | ||
|
||
{% if caption %} | ||
<h2>Generated Caption:</h2> | ||
<div class="caption">{{ caption }}</div> | ||
{% if image_url %} | ||
<img src="{{ image_url }}" alt="Uploaded Image"> | ||
{% endif %} | ||
{% if audio_url %} | ||
<h3>Audio:</h3> | ||
<audio controls> | ||
<source src="{{ audio_url }}" type="audio/mpeg"> | ||
Your browser does not support the audio element. | ||
</audio> | ||
{% endif %} | ||
{% endif %} | ||
</div> | ||
</body> | ||
</html> |