Skip to content

Commit

Permalink
Merge pull request #872 from Anushka-Pote/main
Browse files Browse the repository at this point in the history
Image Caption Generation with Audio Outputs (Issue #869)
  • Loading branch information
UTSAVS26 authored Oct 30, 2024
2 parents 6158af3 + a678d2c commit f98cbbf
Show file tree
Hide file tree
Showing 5 changed files with 277 additions and 0 deletions.
91 changes: 91 additions & 0 deletions Deep_Learning/Image Caption Generation with Audio Output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Image Caption Generator with TTS

This project is a web application that allows users to upload images and generate captions using a pre-trained model. The generated captions can also be converted to speech using Google Text-to-Speech (gTTS), which can be played or downloaded directly from the webpage.

## Features
- Upload an image file and generate a caption using the `Salesforce/blip-image-captioning-base` model.
- Converts the generated caption into audio using Google Text-to-Speech (gTTS).
- Displays the uploaded image along with the generated caption and an audio player to listen to the caption.


## Project Structure

```
project/
├── app.py # Main Flask app
├── static/ # Static files (uploads and audio)
│ ├── uploads/ # Folder for uploaded images
│ └── audio/ # Folder for audio files generated by gTTS
├── templates/
│ └── index.html # HTML file for rendering the webpage
├── requirements.txt # Python dependencies
└── README.md # Project documentation
```

## Installation and Setup

1. **Clone the repository:**

```bash
git clone https://github.com/payal83/image-caption-generator.git
cd image-caption-generator
```

2. **Create a virtual environment:**

```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

3. **Install dependencies:**

```bash
pip install -r requirements.txt
```

4. **Run the Flask application:**

```bash
python app.py
```

5. **Open your browser and navigate to:**

```
http://127.0.0.1:5000/
```

## Dependencies

This project relies on the following libraries:

- **Flask**: Web framework used to create the application.
- **Pillow**: For image processing.
- **transformers**: Hugging Face transformers library for loading the image captioning model.
- **gTTS**: Google Text-to-Speech library for converting text into audio.
- **Werkzeug**: Used for securing file uploads.

To install the dependencies, use:

```bash
pip install -r requirements.txt
```

## Usage

1. **Upload an Image**:
Upload any image file (e.g., `.jpg`, `.png`) through the web interface.

2. **Generate Caption**:
Once uploaded, the model will generate a caption based on the content of the image.

3. **Play Caption as Audio**:
The caption will also be converted to speech using Google Text-to-Speech (gTTS). An audio player will appear, allowing you to listen to the caption.


## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

56 changes: 56 additions & 0 deletions Deep_Learning/Image Caption Generation with Audio Output/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from flask import Flask, render_template, request, url_for
from werkzeug.utils import secure_filename
import os
from PIL import Image
from transformers import pipeline
from gtts import gTTS

app = Flask(__name__)

# Configure upload folder
app.config['UPLOAD_FOLDER'] = 'static/uploads'
app.config['AUDIO_FOLDER'] = 'static/audio'
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # Limit to 16 MB

# Create uploads and audio directories if they don't exist
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
os.makedirs(app.config['AUDIO_FOLDER'], exist_ok=True)

# Initialize the image-to-text pipeline
image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")

@app.route('/', methods=['GET', 'POST'])
def index():
caption = ''
image_url = ''
audio_url = ''

if request.method == 'POST' and 'photo' in request.files:
# Process the uploaded photo
photo = request.files['photo']
filename = secure_filename(photo.filename)
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
photo.save(filepath)

# Convert the image to RGB and process
image = Image.open(filepath).convert('RGB')

# Generate caption
captions = image_to_text(image)
caption = captions[0]['generated_text']

# Set image URL for display
image_url = url_for('static', filename=f'uploads/{filename}')

# Convert caption to audio using gtts
if caption:
tts = gTTS(text=caption, lang='en')
audio_filename = f"{filename.rsplit('.', 1)[0]}.mp3" # Same name but with .mp3 extension
audio_filepath = os.path.join(app.config['AUDIO_FOLDER'], audio_filename)
tts.save(audio_filepath)
audio_url = url_for('static', filename=f'audio/{audio_filename}')

return render_template('index.html', caption=caption, image_url=image_url, audio_url=audio_url)

if __name__ == '__main__':
app.run(debug=True)
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Flask==2.3.2
Pillow==10.0.0
transformers==4.31.0
torch==2.0.1
gTTS==2.3.2
Werkzeug==2.3.6
gunicorn
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Image Caption Generator</title>
<style>
* {
box-sizing: border-box;
margin: 0;
padding: 0;
}

body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
background-color: #f4f4f4;
color: #333;
display: flex;
justify-content: center;
align-items: center;
min-height: 100vh;
padding: 20px;
}

.container {
background-color: white;
box-shadow: 0px 4px 10px rgba(0, 0, 0, 0.1);
border-radius: 10px;
padding: 30px;
width: 100%;
max-width: 600px;
}

h1 {
font-size: 2.5em;
color: #333;
text-align: center;
margin-bottom: 20px;
}

form {
display: flex;
flex-direction: column;
gap: 15px;
}

input[type="file"] {
border: 1px solid #ddd;
padding: 10px;
border-radius: 5px;
font-size: 1em;
cursor: pointer;
}

button {
background-color: #4CAF50;
color: white;
padding: 15px 20px;
border: none;
border-radius: 5px;
font-size: 1.2em;
cursor: pointer;
transition: background-color 0.3s ease;
}

button:hover {
background-color: #45a049;
}

h2 {
font-size: 1.8em;
margin-top: 30px;
color: #333;
}

img {
max-width: 100%;
border-radius: 10px;
margin-top: 15px;
}

.caption {
font-size: 1.2em;
margin-top: 10px;
padding: 15px;
background-color: #f9f9f9;
border-left: 4px solid #4CAF50;
border-radius: 5px;
}

audio {
margin-top: 20px;
width: 100%;
}
</style>
</head>
<body>
<div class="container">
<h1>Image Caption Generator</h1>
<form action="/" method="post" enctype="multipart/form-data">
<label for="photo">Upload an image:</label>
<input type="file" name="photo" accept="image/*" required>
<button type="submit">Generate Caption</button>
</form>

{% if caption %}
<h2>Generated Caption:</h2>
<div class="caption">{{ caption }}</div>
{% if image_url %}
<img src="{{ image_url }}" alt="Uploaded Image">
{% endif %}
{% if audio_url %}
<h3>Audio:</h3>
<audio controls>
<source src="{{ audio_url }}" type="audio/mpeg">
Your browser does not support the audio element.
</audio>
{% endif %}
{% endif %}
</div>
</body>
</html>

0 comments on commit f98cbbf

Please sign in to comment.