Skip to content

INTERSPEECH2023: Multi-band Time-frequency Attention Network for Singing Melody Extraction from Polyphonic Music

Notifications You must be signed in to change notification settings

Annmixiu/MTANet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MTANet

Introduction

The official implementation of "MTANET: Multi-band Time-frequency attention Network for Singing Melody Extraction from Polyphonic Music.

We propose a more powerful singing melody extractor named multi-band time-frequency attention network (MTANet) for polyphonic music. Experimental results show that our proposed MTANet achieves promising performance compared with existing state-of-the-art methods, while keeping with a small number of network parameters.

MTANet Architecture

Important update

2023. 03. 19

(i) Due to the author's mistake, Figure 3 in the manuscript of the paper shows an earlier version, which may cause some misunderstandings for reviewers and readers. I am very sorry for this situation! The following picture is the revised version for reference and I will make formal corrections in the subsequent manuscript.

Hourglass sub-network

(ii) Rename the MMNet to the MTANet.

2023. 03. 20

The author has contacted the chairs and applied for modification. If the modification is successful, please ignore the above update. I am very sorry for the inconvenience to the reviewers and readers.

2023. 05. 20

The Paper has been accepted by INTERSPEECH 2023 and the official version awaits the official release.

The rest of the code will be sorted out and published soon.

2023. 06. 11

All the code is uploaded.

2023. 08. 19

When I read back the paper, I found a mistake witch is one of the dimension tracking descriptions in Figure 4. Specifically, the dimensions after concatenate operation are different for different stages. For example, the input feature size in the first MFA module is (B, 32, F, T), so the feature size after concatenate operation should be (B, 32+4×16, F, T). The difference is that the feature size after concatenate operation in the subsequent MFA modules is (B, 16+4×16, F, T) (i.e., B, (N+1)×C, F, T).

Although the original intention is to facilitate understanding and reading, but we ignored the strict relationship between the paper and the code. Since the paper can no longer be modified, it is very sorry for the troubles that bring readers here.

Getting Started

Download Datasets

After downloading the data, use the txt files in the data folder, and process the CFP feature by feature_extraction.py.

Note that the label data corresponding to the frame shift should be available before generation.

main.py is the main function of this project.

Model implementation

Refer to the file: mtanet.py

The replication code for other comparison models has been uploaded and can be found in the folder: control group model.

Result

Prediction result

The visualization illustrates that our proposed MTANet can reduce the octave errors and the melody detection errors.

estimation1

estimation

Comprehensive result

The scores here are either taken from their respective papers or from the result implemented by us. Experimental results show that our proposed MTANet achieves promising performance compared with existing state-of-the-art methods.

Result

  • Correction:Number of parameters for TONet from 214M to 147M.

Ablation study result

We conducted seven ablations to verify the effectiveness of each design in the proposed network. Due to the page limit, we selected the ADC2004 dataset for ablation study in the paper. More detailed results are presented here.

ablution_ADC2004

ablution_MIREX 05

ablution_MEDLEY DB

Special thanks

Citing

@inproceedings{gao23i_interspeech,
  author={Yuan Gao and Ying Hu and Liusong Wang and Hao Huang and Liang He},
  title={{MTANet: Multi-band Time-frequency Attention Network for Singing Melody Extraction from Polyphonic Music}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={5396--5400},
  doi={10.21437/Interspeech.2023-2494}
}

About

INTERSPEECH2023: Multi-band Time-frequency Attention Network for Singing Melody Extraction from Polyphonic Music

Topics

Resources

Stars

Watchers

Forks

Languages