For benchmarking purpose, this repo hosts the generated test samples of "V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models", AAAI 2024. ([arXiv] [project])
Authors: Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai from University of Sydney and Dolby Laboratories.
Compared to previous methods Im2Wav and CLIPSonic, our V2A-Mapper is trained with 86% fewer parameters but can achieve 53% and 19% improvement in Frechet Distance (FD, fidelity) and Clip-Score (CS, relevance), respectively.
VGGSound contains 199,176 10-second video clips extracted from videos uploaded to YouTube with audio-visual correspondence. Following the original train/test split, we evaluate the performance on 15,446 test samples. Our generated test samples (~5G) for VGGSound can be downloaded from here.
To testify the generalization ability of our V2A-Mapper, we also test on out-of-distribution dataset ImageHear which contains 101 images from 30 visual classes (2-8 images per class). Our generated test samples (~33M) for ImageHear can be downloaded from here.
If you need sample results by V2A-Mapper for your own datasets, we are happy to generate that for you. Please send the request to heng.wang@sydney.edu.au and jianbo.ma@dolby.com.
If you find our work helpful in your research, please kindly cite our paper via:
@inproceedings{v2a-mapper,
title = {V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models},
author = {Wang, Heng and Ma, Jianbo and Pascual, Santiago and Cartwright, Richard and Cai, Weidong},
booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
year = {2024},
}
If you have any questions or suggestions about this repo, please feel free to contact me! (heng.wang@sydney.edu.au)