🔥 [ECCV 2024] Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
This repository hosts the code and resources associated with our paper on multiple-object generation and attribute binding in text-to-image generation models like Stable Diffusion.
Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a
Clone this repository and create a conda environment:
conda env create -f environment.yaml
conda activate ebama
If you rather use an existing environment, just run:
pip install -r requirements.txt
Finally, run:
python -m spacy download en_core_web_trf
to install the transformer-based spaCy NLP parser.
In this work, we use the following datasets:
- AnE dataset from Attend-and-Excite. We provide the AnE dataset
ane_data.py
in thedata
folder. - DVMP dataset from SynGen. Please follow the repo to randomly generate the DVMP dataset.
- ABC-6K dataset from StrDiffusion. We provide the full ABC-6K dataset
ABC-6K.txt
in thedata
folder and a subset of the dataset indata_abc.py
.
To test our method on a specific prompt, run:
python inference.py --prompt "a purple crown and a blue suitcase" --seed 12345
Note that this will download the stable diffusion model CompVis/stable-diffusion-v1-4
. If you rather use an existing copy of the model, provide the absolute path using --model_path
. For example, you can use runwayml/stable-diffusion-v1-5
for Stable Diffusion v1.5.
We mainly use the following metrics to evaluate the generated images:
- Text-Image Full Similarity
- Text-Image Min Similarity
- Text-Caption Similarity
Besides, we also provide the code to compute the following metrics as defined in Attend-and-Excite:
- Text-Image Max Similarity
- Text-Image Avg Similarity
We provide the evaluation code in the metrics
folder. To evaluate the generated images and captions, for example, run:
python metrics/compute_clip_similarity.py
You can define the paths to the generated images and captions and save path in metrics/path_name
We would like to give credits to the following repositories, from which we adapted certain code components for our research:
If you find this code or our results useful, please cite as:
@inproceedings{zhang2024object,
address = {Cham},
author = {Zhang, Yasi and Yu, Peiyu and Wu, Ying Nian},
booktitle = {Computer Vision -- ECCV 2024},
editor = {Leonardis, Ale{\v{s}} and Ricci, Elisa and Roth, Stefan and Russakovsky, Olga and Sattler, Torsten and Varol, G{\"u}l},
isbn = {978-3-031-72946-1},
pages = {55--71},
publisher = {Springer Nature Switzerland},
title = {Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models},
year = {2025}}