Request for Dataset Details - Perceiver-Music-Transformer in Music Cognition Research #4

Xiaoxuan-Wang · 2023-04-29T10:03:43Z

Xiaoxuan-Wang
Apr 29, 2023

Dear Alex,

My name is Xiaoxuan, and I am a graduate student at the University of Cambridge, Centre for Music and Science. I am currently working on a research project under the supervision of Dr. Peter MC Harrison. Our focus is on employing deep learning to model music cognition, specifically by adapting autoregressive music generation models into perception models. These perception models will report the surprise (based on probability) at each note in a given musical sequence, similar to a human participant. Your implementation of the Perceiver AR model (Perceiver-Music-Transformer) appears to be a perfect fit for our research.

However, as the model learns musical structures in a self-supervised manner, it is vital to examine these "black boxes" through a cognitive science lens. We are developing psychological paradigms to test the musical structures learned by the models and their learning process. In these tests, it is crucial to know the models' exposure, as unbiased stimuli design depends on it. If the stimuli contain exact phrases that the model has already been trained with, we would only be testing the model's memory instead of its general understanding of musical structures.

Therefore, I am writing to request more detailed information about the dataset (GIGA-Piano & Euterpe Training Data) you used to train the solo-piano and multi-instrumental models.

If you are interested, we can arrange a Zoom meeting for a more in-depth discussion of our research project. My email address is xw407@cam.ac.uk.

Best,
Xiaoxuan

asigalov61 · 2023-05-06T01:46:32Z

asigalov61
May 6, 2023
Maintainer

@Xiaoxuan-Wang

Hey Xiaoxuan,

I have some time to chat so here is my response. I hope you do not mind if we chat here and in public. I think this discussion is very interesting and may be beneficial to others as well.

First of all, I want to say that your research project sounds very interesting and I would like to learn more about it. I am also flattered that you chose my implementation for your research. It means a lot to me :)

Now, to answer your main question...

GIGA-Piano and Euterpe (as well as original Perceiver models were trained on the combined LAKH+MMD+GiantMIDI datasets. GIGA-Piano used Piano extract and Euterpe and Perceiver used Multi-Instrumental extracts from these datasets.

Unfortunately, my first implementations of all of these projects did not produce very good results for the following reasons:

GIGA-Piano (and GIGA-Piano XL) did not play too well because the current Transformer architecture is not very good at learning from solo instruments. It had been observed by other researchers as well and I confirmed it many times myself.
Euterpe and Perceiver in particular also did not produce interesting results because I think I did not do it right. Perceiver in particular seem to need overlapping and exact training data (of high quality) ) to work as designed. This I will fix soon and hopefully results will be better. Also, I learned that the cross-attention dropout value needs to be relatively low so that there is a certain degree of overfit for the model to work with.
FYI Euterpe project/repo will be updated soon as well. I've got some interesting ideas for it and I hope it will produce better results than the original.

Now, in regard to your research... Perceiver architecture is definitely promising in terms of helping the model to learn and improvise, instead of just memorizing and overfitting. However, I think the problem really lies in the incredible complexity of music and also in the way we feed and model the data. Music is a bit more complicated than text or images, so IMHO current transformer architectures struggle with its complexity.

Also and IMHO, in order to make NN, improvise (in particular), there needs to be a proper and clever way of "explaining" music to the NN instead of just brut-forcing it.

If you have seen my GIGA-Piano XL implementation, I experimented with KNN and Euclidian algorithms to help the model to generate music but so far results were unimpressive. KNN shows very good promise and it is being used in SOTA implementations already but there is still something lacking (especially for music) to produce interesting results.

I will be re-training the Perceiver model shortly with the new MIDI processor and music encoding/feed which should (in theory) produce good results so please check back soon for the new model and training code.

I hope this answers your questions and I am looking forward to your thoughts.

Most sincerely,

Alex.

PS. If you want to chat in private,do not feel uncomfortable to let me know.

0 replies

asigalov61 · 2023-05-06T02:05:34Z

asigalov61
May 6, 2023
Maintainer

PPS. If you have time, you might want to check out these two samples of mine: This is the best I could achieve in terms of teaching the models/transformers to improvise...

This one was the result of POP909 Dataset @ ~0.15 loss

https://soundcloud.com/aleksandr-sigalov-61/ov-sample-4-20210417170114-aac?in=aleksandr-sigalov-61/sets/exclusive-preview-optimus-virtuoso&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

And this second one was the result of the chordification of solo-Piano music. Most interesting part starts @ ~ 15-minute mark.

https://soundcloud.com/aleksandr-sigalov-61/debussy-chordified-medley?in=aleksandr-sigalov-61/sets/debussy-solo-piano-music-ai&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

These were auto-generated by auto-regression so its not supervised music...

0 replies

Xiaoxuan-Wang · 2023-05-08T15:36:04Z

Xiaoxuan-Wang
May 8, 2023
Author

Thank you for your comprehensive response! It's essential for our research project to know that the model is trained using the LAKH, MMD, and GiantMIDI datasets.

As you mentioned, while there is still room for improvement in the composition capabilities of the current GIGA-Piano and Euterpe models, they already perform quite similarly to human participants in "listening" experiments.

For the implementation of "listening," we essentially manipulate the autoregressive generative process of the models in an invasive manner. During the generation of each token, we record the probability of its creation and then feed a pre-selected token (based on the piece we want the model to "listen" to) for the next generation round. In this way, the model emulates listening to music and continuously reports its level of surprise.

Our analysis of melodic and harmonic predictions yielded results generally consistent with data from human subjects. We also obtained intriguing outcomes when examining polyphonic predictions.

It's worth noting that composing music directly from a given musical context may be challenging for the average human listener. Therefore, these composition-trained models perform satisfactorily in theoretically simpler listening experiments. We are interested not only in areas where the model falls short compared to human subjects but also in aspects where it outperforms humans. To model human listening more accurately, we might even need to weaken some of the model's attentional abilities.

I concur with your observation regarding the token system, and I'm curious about the extent to which different MIDI encoding methods might impact the model's performance.

I anticipate the forthcoming improvements to these models and the resulting performance difference they will introduce in our listening experiments.

0 replies

asigalov61 · 2023-05-08T22:46:39Z

asigalov61
May 8, 2023
Maintainer

@Xiaoxuan-Wang You are welcome and if you have more questions about those implementations, feel free to ask at any time :)

Thank you for explaining your project, methodology, and approach. It sounds very interesting and I wish you success with your research.

In regard to encoding, I personally and lately use triplets (3-token encoding to encode MIDI notes. I found such a design to be efficient and easy to use.

Now, in terms of encoding structure, I found the following useful things that help the model to learn and play well:

Grouping pitches and channels/instruments together. This is a must or the model will have difficulty learning music.
No velocity or minimal velocity degrees. Velocity is usually flat anyway or noise so no need to burden the model with that. This helps improve performance and lower training loss.
Delta Start Times should be at least 7 bits but I think 8 bits (a byte - 256) works best.
Durations must also be 7-8 bits. This helps the model to play. Also, I think adding channels/instruments to durations should also help in principle.
Everything else like bar positions, counters, and other stuff should only be added if absolutely needed as it does not help at all with performance. All this extra stuff is only needed for specific downstream tasks and it consumes tokens/dictionary size.

These are the most important points IMHO about encoding and its structure.

Please stand by for Perceiver and Euterpe updates. I should post everything this week (hopefully) and hopefully results will be better.

I do not plan to update GIGA-Piano (I do not see any point atm) so only Perceiver and Euterpe will be updated.

Euterpe update will be posted here:
https://github.com/asigalov61/Euterpe-X

Most sincerely,

Alex.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Dataset Details - Perceiver-Music-Transformer in Music Cognition Research #4

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Request for Dataset Details - Perceiver-Music-Transformer in Music Cognition Research #4

Xiaoxuan-Wang Apr 29, 2023

Replies: 4 comments

asigalov61 May 6, 2023 Maintainer

asigalov61 May 6, 2023 Maintainer

Xiaoxuan-Wang May 8, 2023 Author

asigalov61 May 8, 2023 Maintainer

Xiaoxuan-Wang
Apr 29, 2023

asigalov61
May 6, 2023
Maintainer

asigalov61
May 6, 2023
Maintainer

Xiaoxuan-Wang
May 8, 2023
Author

asigalov61
May 8, 2023
Maintainer