Visually Informed Binaural Audio Generation without Binaural Audios (CVPR 2021)

Xudong Xu*, Hang Zhou*, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin

Stereophonic audio, especially binaural audio, plays an essential role in immersive viewing environments. Recent research has explored generating stereophonic audios guided by visual cues and multi-channel audio collections in a fully-supervised manner. However, due to the requirement of professional recording devices, existing datasets are limited in scale and variety, which impedes the generalization of supervised methods to real-world scenarios. In this work, we propose PseudoBinaural, an effective pipeline that is free of binaural recordings. The key insight is to carefully build pseudo visual-stereo pairs with mono data for training. Specifically, we leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between the location of a sound source and the received binaural audio. Then in the visual modality, corresponding visual cues of the mono data are manually placed at sound source positions to form the pairs. Compared to fully-supervised paradigms, our binaural-recording-free pipeline shows great stability in the cross-dataset evaluation and comparable performance under subjective preference. Moreover, combined with binaural recorded data, our method is able to further boost the performance of binaural audio generation under supervised settings.

[Project] [Paper] [Demo]

Requirements

Python 3.7 is used. Basic requirements are listed in the 'requirements.txt'

pip install -r requirements.txt

Dataset

FAIR-Play can be accessed here. MUSIC21 can be accessed here. YT-Music can be accessed here.

Training and Testing

All the training and testing bash scripts can be found in './scripts'. For FAIR-Play dataset, we create five non-overlapping splits in folder 'new_splits' as illustrated in the paper. Before training, please replace the contained items 'xxxxxx.mp3' into absolute path and ensure the 'audio_resave' folder and 'frames' folder locate in the same directory. Notice that, each item in 'data/mono_sources' is the audio file as well as the cropped object patch. For each video presented in 'data/mono_sources', I crop the object out and store the patches into 'new_patches' folder. The Faster-RCNN model from this repo has been adopted to do the cropping. The model trained on the non-overlapping split1 can be found here.

We have tried two different schemes for creating the pseudo visual-stereo pairs. One method is padding the visual patches on a pre-defined background image and leveraging the Possion blending to refine the boundary. Another is to place the visual patches on an empty background. We found the performance of empty background scheme is slightly better than the blending one.

License and Citation

The usage of this software is under CC-BY-4.0.

@inproceedings{xu2021visually,
  title={Visually Informed Binaural Audio Generation without Binaural Audios},
  author={Xu, Xudong and Zhou, Hang and Liu, Ziwei and Dai, Bo and Wang, Xiaogang and Lin, Dahua },
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)},
  year={2021}
}

Acknowledgement

The structure of this codebase is borrowed from 2.5D Visual Sound and SepStereo.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
models		models
new_splits		new_splits
options		options
scripts		scripts
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo_stereo.py		demo_stereo.py
evaluate.py		evaluate.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visually Informed Binaural Audio Generation without Binaural Audios (CVPR 2021)

Requirements

Dataset

Training and Testing

License and Citation

Acknowledgement

About

Releases

Packages

Languages

License

SheldonTsui/PseudoBinaural_CVPR2021

Folders and files

Latest commit

History

Repository files navigation

Visually Informed Binaural Audio Generation without Binaural Audios (CVPR 2021)

Requirements

Dataset

Training and Testing

License and Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages