Digitally process audio data with AI & ML
For the course Digital Creativity we explored the open source library Google Magenta DDSP.
We decided to work mostly on Google Colab because it´s much more convenient for us regarding installations, dependencies and training on GPU. The only exception to this is working with the dataset: It was all downloaded from Google Clouds to a local disk and and sorted there by using this notebook.
There are arlready notebooks on converting your own wave data to the needed format (TFR) when working with DDSP. Since we did not have enough of the right wave data we used a TFR dataset with prepared MIDI samples.
We accomodated ourselves to DDSP by going through a lot of the tutorials (DDSP TUTORIALS ).
Afterwards we used our gathered and sorted TFR data for small training on a single instrument type and then predict a sample of another instrument with the help of an adjusted DDSP NOTEBOOK (we recommend working with this version for reproduction, continuation etc.) . Prediction in this sense means, you predict how that sample (e.g. a keyboard tone) would sound with the sound characteristics (timbre) of a different instrument (e.g. string) or simpler: How would a keyboard tone sound if it played with a string sound/timbre ?
One song of 3 minutes : 1 Million time steps BUT relevant information is much less! The art is to extract those featuers and find a meaningful representation for music. If music is only structured as a bit stream consisting of 1´s and 0´s it is very difficult to know what´s going on.
For strided convolution waves are represented as overlapping frames, whereas in reality sound moves in different phases and would have to be aligned precisely between two fixed frames or else it would lead to bias.
Another widely used method was to just learn all the waveform packages, decompose them into sine and cosine waves and finally recreate the soundwave out of the Fourier waves. However, the waveforms overlap and therefore this procedure leads to bias again.
Autogenerative models try to mitigate these problems by constructing the waveform sample by sample so they do not suffer from the same bias the others do.
However, the waveform shapes still do not perfectly correlate with human perception and get incoherently corrected during model training:
For example the waveforms on the right sound the same for humans but cause different perceptual losses for the model. Moreover they need alot of data to work.
Oscillation is defined as the process of repeating variations of any quantity or measure about its equilibrium value in time .
Most of the things in nature oscillate (vibrate) at a characteristic (natural) frequency or frequencies.
Some familiar examples are the motions of the pendulum of a clock and playground swing, up and down motion of small boats, ocean waves, and motion of the string or reeds on musical instruments.
Rather than predicting the waveforms or Fourier coefficients those models directly generates the oscillations.
These analysis/synthesis models use expert knowledge and hand-tuned heuristics to xtract synthesis parameters (analysis) that are interpretable ( loudness in dB and frequencies in Hz) and can be used by the generative algorithm (synthesis).
- Fundamental Frequency F0 (Hz)
- Harmonics (F0 multiplications: odd, even, ...)
- Amplitude (dB)
For our first trial we used the nsynth/full dataset but then realized the features weren´t optimally suited for working with DDSP so we changed to the nsynth/gansynth_subset.f0_and_loudness/2.3.3. with f0 and loudness features which were missing.
If you would like to try out a training on a single instrument without downloading the whole dataset we uploaded two TFR files (containing a lot of samples!) for string and keyboard (data folder) and used up our whole LF/Large File Storage Github resources. The data there should be enough to train on strings samples (xor keyboard) and predict with one keyboard sample (xor string)
for more efficient training we downloaded the whole GANSYNTH 2.3.3. (sub) dataset from Google Clouds with this link
to download multiple items at once you need to use gsutil. This command requires to have parts of Google CLI installed on your computer
1.) install Google CLI
2.) make sure gsutil is installed on Google CLI (e.g. try gsutil ls
in command prompt: is the command recognized?)
3.) download files with gsutil command from the cloud to a (local) storage location (external drive, e.g. "E:\gansynth", recommended for big data amounts:
gsutil -m cp -r "gs://tfds-data/PATH" "STORAGE_PATH"
For our project we used the TensorfFlow GAN subset of the NSYNTH dataset. It offers preprocessed samples which contain the most relevant features (amplitude and frequency) ready to use with the DDSP library.
For efficient training we downloaded the 11 instrument samples instead of streaming them. Since the data wasn´t storted by instrument type we had to do this step additionally to observe the effects of training on a single instrument type.
We read the TFRecord files into Python, parsed them to JSON to identify the instrument label and then wrote them back to TFRecord files with the help of this notebook. For this to work properly, we had to continuously remove the written objects from the memory such that it did not overflow.
All in all this procedure took around 10 hours to sort the samples for the first dataset and then significantly less time for the second (samller) set (30-60 minutes).
To get our TFR data working with the DDSP (e.g. notebook training ) we had to adjust the classes slightly do accept the feature names with slashes instead of dashes (f0_hz = f0/hz) else we had to do the whole sorting process again to change feature names
The features are presented as floatList tensors which contain the values over very small timesteps (e.g. length of 64000).
For efficient processing, (the features of) the input data has to be aligned with the architecture of a neural network.
DDSP achitecture is based on a transformer network.
That´s where the DDSP library comes in: it offers sound modules (synthesizers) which are differentiable and therefore can use back propagation to tune their synthesizer parameters (analog to recreating a sound on a synthesizer) and do not learn as much bias as the other models by the help of deep specialized and structured layers.
Thanks to these layer types we have faster training of autoencoders and therefore quick feedback, which offers a more instrument like workflow than iterating for 16 hours of training until you can implement further changes.
For training on a TFR dataset we recommend using this notebook as a starting point
We received the following outputs when training with 3 different synthesizers (= neural layers) trained on the same string data (until learning curve flattening, usually around 4.5-5) and predicted on the same keyboard sample
We can observe from the spectograms that the harmonic synthesizer - as you´d probably expected - has the richest harmonic distribution
Since the time for this project was scarce and the complexity relatively high we did not yet complete a full big training. To continue with the gathered data and lessons learned from a small training on a singular instrument, options for long training would be:
- try bigger training on the timbre transfer notebook
- train a VST on the VST notebook
- ...
we also prepared the timbre transfer notebook since the original version with the updated dependencies did not work.
for more content, just have a look at ddsp demos: there are a lot of (new) ideas once your familiar with the library and the data!
Youtube: Google staff research scientist Jesse Engel explaining DDSP
All notebook sources in the folder ddsp_notebooks_adjusted belong to Google Magenta´s DDSP research team.
# Copyright 2021 Google LLC. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
@inproceedings{
engel2020ddsp,
title={DDSP: Differentiable Digital Signal Processing},
author={Jesse Engel and Lamtharn (Hanoi) Hantrakul and Chenjie Gu and Adam Roberts},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=B1x1ma4tDr}
}
training on single instrument notebook
timbre transfer notebook