This documents our progress towards Singing Style Transfer, as measured on one of our aligned benchmark examples.
For all of the examples below, Our target output is this:
And the source content is this:
Post-processing network results go here!
We have a functioning 1D-PatchMatch baseline using both spectrogram and DeepSpeech features. DeepSpeech features result in a noisier output, but the results are approaching the range of output that can be fixed with post-processing. Using spectrogram features results in a cleaner output that is less stylistically correct.
Neural Features:
Spectral Features:
Perfect Features:
(This is the result if we magically selected the optimum style patch for each input patch. It represents an upper bound on the quality of the stylized audio using a patch-cloning approach without post-processing)