This is the repo with the code to conduct a comparative analysis of different audio representation models.
This repo using the MagnaTagATune dataset to evaluate the performance of different music representation model in the downstream task of music tagging.
The audio files for MagnaTagATune dataset can be downloaded here. Extract the audio files to audio directory in MTT folder. The directory structure will be as shown below:
.
├── MTT
│ ├── audios
│ │ │── 0
│ │ │── 1
│ │ │── ...
│ ├── magnatagatune.json
├── evaluate_clap.py
├── evaluate_mert.py
└── ...
We use the same split as Jukebox.
We evaluate the following music representation models in this paper:
- MERT (https://arxiv.org/abs/2306.00107)
- CLAP (https://arxiv.org/abs/2211.06687)
- Imagebind (https://arxiv.org/abs/2305.05665)
- Wav2CLIP (https://arxiv.org/abs/2110.11499)
The comparison of the models are shown below:
Model | MTTAUC | MTTAP |
---|---|---|
ImageBind | 88.55% | 40.19% |
JukeBox | 91.50% | 41.40% |
OpenL3 | 89.35% | 42.88% |
CLAP | 70.04% | 27.95% |
Wav2CLIP | 90.15% | 49.12% |
MERT | 93.91% | 59.57% |