Skip to content

AudioLLMs/AudioBench

Repository files navigation

Prometheus-Logo

🔥 AudioBench 🔥

arXiv Hugging Face Organization License

⚡ A repository for evaluating AudioLLMs in various tasks 🚀 ⚡
⚡ AudioBench: A Universal Benchmark for Audio Large Language Models 🚀 ⚡

Change log

  • July 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
  • July 2024: Support all 26 datasets listed in AudioBench manuscript.

🔧 Installation

Installation with pip:

pip install -r requirements.txt

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

⏩ Quick Start

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.

# Step 1:
# Server the model as judge
# It will auto-download the model and may requires verification from Hugging Face.
# In the demo, we use 2 H100 80G in order to host the model.
# For smaller VRAM, you may need to reduce the model size.
bash host_model_judge_llama_3_70b_instruct.sh

# Step 2:
# The example is done with 3 H100 80G GPUs.
# The AudioLLMs model inference is done on GPU 2 since GPU 0&1 is used to host model-as-judge services.
# This is a setting for just using 50 samples for evaluation.
MODEL_NAME=whisper_large_v3_with_llama_3_8b_instruct
GPU=2
BATCH_SIZE=1
METRICS=llama3_70b_judge
OVERWRITE=True
NUMBER_OF_SAMPLES=50

DATASET=cn_college_listen_test

bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES

# Step 3:
# The results would be like:
#    "llama3_70b_judge": {
#        "judge_score": 3.12,
#        "success_rate": 1.0
#    }

The example is how to get started. To evaluate on the full datasets, please refer to Examples.

# After model weight download, run the evaluation script for all datasets
bash examples/eval_SALMONN_7B.sh

📚 Supported Models and Datasets

Datasets

SU=Speech Understanding
  ASR=Automatic Speech Recognition
  SQA=Speech Question Answering
  SI=Speech Instruction

ASU=Audio Scene Understanding
  AC=Audio Captioning
  ASQA=Audio Scene Question Answering

VU=Voice Understanding
  AR=Accent Recognition
  GR=Gender Recognition
  ER=Emotion Recognition
Dataset Category Task Metrics Status
LibriSpeech-Clean SU ASR WER
LibriSpeech-Other SU ASR WER
CommonVoice-15-EN SU ASR WER
Peoples-Speech SU ASR WER
GigaSpeech SU ASR WER
Earning21 SU ASR WER
Earning22 SU ASR WER
Tedlium3 SU ASR WER
Tedlium3-Longform SU ASR WER
CN-College-Listen SU SQA Model-as-Judge
SLUE-P2-SQA5 SU SQA Model-as-Judge
Public-SG-SpeechQA SU SQA Model-as-Judge
DREAM-TTS SU SQA Model-as-Judge
OpenHermes-Audio SU SI Model-as-Judge
ALPACA-Audio SU SI Model-as-Judge
AudioCaps ASU AC Model-as-Judge / METEOR
WavCaps ASU AC Model-as-Judge / METEOR
Clotho-AQA ASU ASQA Model-as-Judge
AudioCaps-QA ASU ASQA Model-as-Judge
WavCaps-QA ASU ASQA Model-as-Judge
VoxCeleb-Accent VU AR Model-as-Judge
VoxCeleb-Gender VU GR Model-as-Judge
IEMOCAP-Gender VU GR Model-as-Judge
IEMOCAP-Emotion VU ER Model-as-Judge
MELD-Sentiment VU ER Model-as-Judge
MELD-Emotion VU ER Model-as-Judge

Models

Model Size Notes Status
Whisper-Large + Llama-3-8B-Instruct ~8B Cascade Models
SALMONN-7B ~7B AudioLLM - Fusion Model
Qwen-Audio ~8B AudioLLM - Fusion Model TODO
Qwen2-Audio ~8B AudioLLM - Fusion Model TODO

📖 Citation

If you find our work useful, please consider citing our paper!

@article{wang2024audiobench,
  title={AudioBench: A Universal Benchmark for Audio Large Language Models},
  author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
  journal={arXiv preprint arXiv:2406.16020},
  year={2024}
}