Request for Dataset Details - Perceiver-Music-Transformer in Music Cognition Research #4
Replies: 4 comments
-
Hey Xiaoxuan, I have some time to chat so here is my response. I hope you do not mind if we chat here and in public. I think this discussion is very interesting and may be beneficial to others as well. First of all, I want to say that your research project sounds very interesting and I would like to learn more about it. I am also flattered that you chose my implementation for your research. It means a lot to me :) Now, to answer your main question... GIGA-Piano and Euterpe (as well as original Perceiver models were trained on the combined LAKH+MMD+GiantMIDI datasets. GIGA-Piano used Piano extract and Euterpe and Perceiver used Multi-Instrumental extracts from these datasets. Unfortunately, my first implementations of all of these projects did not produce very good results for the following reasons:
Now, in regard to your research... Perceiver architecture is definitely promising in terms of helping the model to learn and improvise, instead of just memorizing and overfitting. However, I think the problem really lies in the incredible complexity of music and also in the way we feed and model the data. Music is a bit more complicated than text or images, so IMHO current transformer architectures struggle with its complexity. Also and IMHO, in order to make NN, improvise (in particular), there needs to be a proper and clever way of "explaining" music to the NN instead of just brut-forcing it. If you have seen my GIGA-Piano XL implementation, I experimented with KNN and Euclidian algorithms to help the model to generate music but so far results were unimpressive. KNN shows very good promise and it is being used in SOTA implementations already but there is still something lacking (especially for music) to produce interesting results. I will be re-training the Perceiver model shortly with the new MIDI processor and music encoding/feed which should (in theory) produce good results so please check back soon for the new model and training code. I hope this answers your questions and I am looking forward to your thoughts. Most sincerely, Alex. PS. If you want to chat in private,do not feel uncomfortable to let me know. |
Beta Was this translation helpful? Give feedback.
-
PPS. If you have time, you might want to check out these two samples of mine: This is the best I could achieve in terms of teaching the models/transformers to improvise... This one was the result of POP909 Dataset @ ~0.15 loss And this second one was the result of the chordification of solo-Piano music. Most interesting part starts @ ~ 15-minute mark. These were auto-generated by auto-regression so its not supervised music... |
Beta Was this translation helpful? Give feedback.
-
Thank you for your comprehensive response! It's essential for our research project to know that the model is trained using the LAKH, MMD, and GiantMIDI datasets. As you mentioned, while there is still room for improvement in the composition capabilities of the current GIGA-Piano and Euterpe models, they already perform quite similarly to human participants in "listening" experiments. For the implementation of "listening," we essentially manipulate the autoregressive generative process of the models in an invasive manner. During the generation of each token, we record the probability of its creation and then feed a pre-selected token (based on the piece we want the model to "listen" to) for the next generation round. In this way, the model emulates listening to music and continuously reports its level of surprise. Our analysis of melodic and harmonic predictions yielded results generally consistent with data from human subjects. We also obtained intriguing outcomes when examining polyphonic predictions. It's worth noting that composing music directly from a given musical context may be challenging for the average human listener. Therefore, these composition-trained models perform satisfactorily in theoretically simpler listening experiments. We are interested not only in areas where the model falls short compared to human subjects but also in aspects where it outperforms humans. To model human listening more accurately, we might even need to weaken some of the model's attentional abilities. I concur with your observation regarding the token system, and I'm curious about the extent to which different MIDI encoding methods might impact the model's performance. I anticipate the forthcoming improvements to these models and the resulting performance difference they will introduce in our listening experiments. |
Beta Was this translation helpful? Give feedback.
-
@Xiaoxuan-Wang You are welcome and if you have more questions about those implementations, feel free to ask at any time :) Thank you for explaining your project, methodology, and approach. It sounds very interesting and I wish you success with your research. In regard to encoding, I personally and lately use triplets (3-token encoding to encode MIDI notes. I found such a design to be efficient and easy to use. Now, in terms of encoding structure, I found the following useful things that help the model to learn and play well:
These are the most important points IMHO about encoding and its structure. Please stand by for Perceiver and Euterpe updates. I should post everything this week (hopefully) and hopefully results will be better. I do not plan to update GIGA-Piano (I do not see any point atm) so only Perceiver and Euterpe will be updated. Euterpe update will be posted here: Most sincerely, Alex. |
Beta Was this translation helpful? Give feedback.
-
Dear Alex,
My name is Xiaoxuan, and I am a graduate student at the University of Cambridge, Centre for Music and Science. I am currently working on a research project under the supervision of Dr. Peter MC Harrison. Our focus is on employing deep learning to model music cognition, specifically by adapting autoregressive music generation models into perception models. These perception models will report the surprise (based on probability) at each note in a given musical sequence, similar to a human participant. Your implementation of the Perceiver AR model (Perceiver-Music-Transformer) appears to be a perfect fit for our research.
However, as the model learns musical structures in a self-supervised manner, it is vital to examine these "black boxes" through a cognitive science lens. We are developing psychological paradigms to test the musical structures learned by the models and their learning process. In these tests, it is crucial to know the models' exposure, as unbiased stimuli design depends on it. If the stimuli contain exact phrases that the model has already been trained with, we would only be testing the model's memory instead of its general understanding of musical structures.
Therefore, I am writing to request more detailed information about the dataset (GIGA-Piano & Euterpe Training Data) you used to train the solo-piano and multi-instrumental models.
If you are interested, we can arrange a Zoom meeting for a more in-depth discussion of our research project. My email address is xw407@cam.ac.uk.
Best,
Xiaoxuan
Beta Was this translation helpful? Give feedback.
All reactions