Published code to github/gitlab

zingmars · Jul 17, 2019 · 5d971db · 5d971db
commit 5d971db
Show file tree

Hide file tree

Showing 17 changed files with 911 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,16 @@
+__pycache__/*
+*.sw[a-p]
+*.ini
+log.*
+*.log
+data/*
+test
+test-valid
+ignored/*
+*.save
+*.weights
+*.out
+*.wav
+*.h5
+*.csv
+*.png
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,7 @@
+Copyright 2019 Ingmars Daniels Melkis <contact@zingmars.me>
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
diff --git a/README.MD b/README.MD
@@ -0,0 +1,59 @@
+Vocal and music seperation using a CNN
+===
+
+# Description
+
+This CNN attempts to separate the vocals from the music. It does so by training on the amplitude data of the audio file and tries to estimate where the voiced parts are. Vocal separation is done by generating a binary mask of the time-frequency bins that the network thinks contain the vocals and applying it to the original file.
+
+# Requirements
+
+* Python 3
+* Tensorflow (Tested with tensorflow-gpu), Keras
+* And a few other python libraries that you can install by running `pip install -r requirements.txt` in the root directory
+
+# Dataset
+
+* The script was only tested with .wav files (16-bit and 24-bit signed wavs should work, 32-bit float doesn't). Other formats might work if your version of librosa is capable of opening it.
+* Training data folder should have individual folder for each song. Each song should have two files - `mixture.wav` (the full song) and `vocals.wav` (original vocals). See below for a list of data sets that you could potentially use to train this network.
+* To see an example of how the directory structures should look like, refer to [structure.md](structure.md).
+* To make things faster, all songs should have the same sampling rate as configured (I only tested 22050kHz, but other sample rates should work) and should be in mono (if it isn't, the script will convert them, but the result isn't saved anywhere and it takes a while).
+
+## Example data sets
+
+* DSD100: https://sigsep.github.io/datasets/dsd100.html
+* MUSDB: https://sigsep.github.io/datasets/musdb.html
+* MedleyDB: https://medleydb.weebly.com/
+
+# Setting up
+
+1. pip install -r requirements.txt
+2. py main.py
+
+# Running
+
+1. `python main.py -h` to see all arguments
+2. `python main.py` will train the network with the default options
+3. `python main.py --mode=separate --file=audio.wav` will attempt source separation on `audio.wav` and will output `vocals.wav`
+4. `python main.py --mode=evaluate` will evaluate the effectiveness of audio source separation. More information below.
+
+## Configuring
+
+All relevant settings are located in the `config.ini` file. The file doesn't exist in the repository and will be automatically created and prepopulated with the default values on first run. For information on what each option does see `config.py`.
+
+## Evaluating
+
+This program also includes a simple wrapper around BSS-Eval which can be used to determine how effective audio source separation is. To use it you need - the original vocals (`vocals.wav`), the original accompaniment (`accompaniment.wav`), estimated vocals (`estimated-vocals.wav`) and estimated accompaniment (`estimated-accompaniment.wav`). If you don't have the accompaniment but have a mixture and vocals, you can use the `apply_vocal_mask.py` script in the `misc` folder. To get estimated accompaniment, you need to perform separation with the `--save_accompaniment` flag set to true. After you have all the files, create a data directory that contains a directory with the name of the song and copy all 4 files to it.
+
+Note that librosa by default outputs a 32bit wav file which it can't load without ffmpeg, so you either need to add an extra conversion step between separating and evaluation or install ffmpeg and add it to your PATH. Both files need to have the same format and bitrate for evaluation to be successful as well.
+
+### Bug
+
+For a reason I haven't had the time to determine yet the neural network output sometimes has slightly less samples (~76 samples to be exact which is around 0.004s worth of samples) than the original. The evaluation script will account for this, but be advised that some samples are being lost during evaluation.
+
+## Weights files when training
+
+While training the network will save its weights every 5 epochs to avoid data loss should you have a power failure or a similar issue. These files may be deleted after training.
+
+## Misc
+
+The misc directory contains a few scripts that might be useful but aren't required to run the neural net.
diff --git a/config.py b/config.py
@@ -0,0 +1,31 @@
+import configparser
+
+def prepare_config(filename):
+ config = configparser.ConfigParser()
+ config.read(filename)
+
+ # Set defaults
+ config_get(config, 'logging', 'logfile', 'log.txt')
+ config_get(config, 'logging', 'loglevel', 'INFO') #debug,info,warning,critical
+ config_get(config, 'logging', 'logtype', 'console') #file/console
+
+ config_get(config, 'song', 'sample_size', "22050") #Sample rate of the audio we will work with. If loaded audio doesn't match, it will be resampled.
+ config_get(config, 'song', 'window_size', "1024") #We will get window size / 2 + 1 frequency bins to work with. 1024-1568 seems to be the perfect vales.
+ config_get(config, 'song', 'hop_length', "256") #Size of each bin = hop size / sample size (in ms). The smaller it is, the more bins we get, but we don't need that much resolution.
+ config_get(config, 'song', 'sample_length', "25") #Dictates how many frequency bins we give to the neural net for context. Less samples means more guesswork from the network, but also more samples from each song.
+
+ config_get(config, 'model', 'save_history', "true") #Saves keras accuracy and loss history per epoch
+ config_get(config, 'model', 'history_filename', "history.csv")
+
+ with open(filename, 'w') as configfile: # If the file didn't exist, write default values to it
+ config.write(configfile)
+ return config
+
+def config_get(config, section, key, default):
+ try:
+ config.get(section, key)
+ except configparser.NoSectionError:
+ config.add_section(section)
+ config.set(section, key, default)
+ except configparser.NoOptionError:
+ config.set(section, key, default)
diff --git a/dataset.py b/dataset.py
@@ -0,0 +1,76 @@
+import os
+import sys
+import logging
+from song import Song
+import numpy as np
+
+# Dataset: Loads and passes test data to the model
+class Dataset:
+ def __init__(self, logger, config):
+ self.logger=logger
+ self.config=config
+ # Raw data
+ self.mixtures = []
+ self.vocals = []
+ # Outputs for CNN
+ self.mixture_windows = []
+ self.labels = []
+
+ # Load mixture and vocals and generates STFT for them
+ def load(self, folder):
+ if os.path.isdir(folder):
+ for root, dirs, files in os.walk(folder):
+ for file in filter(lambda f: f.endswith(".wav"), files):
+ self.logger.info("Loading song %s and computing stft for it.", os.path.join(root, file))
+ song_type = os.path.splitext(file)[0].lower()
+ if song_type == "mixture" or song_type == "vocals":
+ song = Song(self.logger, os.path.basename(root), self.config)
+ song.load_file(os.path.join(root,file))
+ song.compute_stft()
+ if(song_type == "mixture"):
+ self.mixtures.append(song)
+ elif(song_type == "vocals"):
+ self.vocals.append(song)
+ self.logger.debug("%s loaded successfully.", song_type)
+ else:
+ self.logger.debug("File %s is not named correctly. Ignoring...", song_type)
+ else:
+ self.logger.critical("Folder %s does not exist!", folder)
+ sys.exit(8)
+ if(len(self.mixture) != len(self.vocals)):
+ self.logger.critical("There doesn't appear to be a vocal track for each mixture (or the other way around).")
+ sys.exit(15)
+
+ def get_data_for_cnn(self):
+ length = self.config.getint("song", "sample_length")
+ self.logger.info("Preparing data of type 'mixture' for the CNN...")
+ if len(self.mixtures) == 0:
+ self.logger.critical("No mixtures for training found. Did you name them wrong?")
+ sys.exit(9)
+ self.logger.debug("Preparing %i songs...", len(self.mixtures))
+ amplitudes = None
+ for num in range(0, len(self.mixtures)):
+ if amplitudes is None:
+ amplitudes = self.mixtures[0].split_spectrogram(length)
+ else:
+ amplitudes = np.vstack((amplitudes, self.mixtures[0].split_spectrogram(length)))
+ del self.mixtures[0]
+ self.logger.debug("Got %i slices. Each slice has %i frequency bins, and each frequency bin has %i time slices.", len(amplitudes), len(amplitudes[0]), len(amplitudes[0][0]))
+ self.logger.debug("Adding a 4th dimension to placate the CNN model...")
+ # Add a dimension to make the CNN accept the data. Signifies that we have a greyscale "picture"
+ amplitudes = np.array(amplitudes).reshape(len(amplitudes), len(amplitudes[0]), len(amplitudes[0][0]), 1)
+ self.mixture_windows = amplitudes
+
+ def get_labels_for_cnn(self):
+ length = self.config.getint("song", "sample_length")
+ self.logger.info("Preparing data of type 'vocals' for the CNN...")
+ if len(self.vocals) == 0:
+ self.logger.critical("No original vocals for training found. Did you name them wrong?")
+ sys.exit(10)
+ self.logger.debug("Preparing %i songs...", len(self.vocals))
+ labels = []
+ for num in range(0, len(self.vocals)):
+ labels.extend(self.vocals[0].get_labels(length))
+ del self.vocals[0]
+ self.logger.debug("Got %i slices.", len(labels))
+ self.labels = np.array(labels)
diff --git a/evaluate.py b/evaluate.py
@@ -0,0 +1,96 @@
+# Evaluate the accuracy of the neural network by calculating SDR (distortion)
+# SIR (interference from other sources) and SAR (artifacts)
+import numpy as np
+import museval
+import os
+import sys
+from song import Song
+
+class Evaluator:
+ def __init__(self, logger, config):
+ self.logger=logger
+ self.config=config
+ self.vocals=None
+ self.accompaniments=None
+ self.estimated_vocals=None
+ self.estimated_accompaniments=None
+ self.names=None
+
+ def load_data(self, folder):
+ self.vocals=[]
+ self.accompaniments=[]
+ self.estimated_vocals=[]
+ self.estimated_accompaniments=[]
+ if os.path.isdir(folder):
+ for root, firs, files in os.walk(folder):
+ for file in filter(lambda f: f.endswith(".wav"), files):
+ song_type = os.path.splitext(file)[0].lower()
+ self.logger.info("Loading song %s.", os.path.join(root, file))
+ if song_type == "vocals" or song_type == "accompaniment" or song_type == "estimated_vocals" or song_type == "estimated_accompaniment":
+ song = Song(self.logger, os.path.basename(root), self.config)
+ song.load_file(os.path.join(root,file))
+ if(song_type == "vocals"):
+ self.vocals.append(song)
+ elif(song_type == "accompaniment"):
+ self.accompaniments.append(song)
+ elif(song_type == "estimated_vocals"):
+ self.estimated_vocals.append(song)
+ elif(song_type == "estimated_accompaniment"):
+ self.estimated_accompaniments.append(song)
+ self.logger.debug("%s loaded successfully.", song_type)
+ else:
+ self.logger.debug("File %s is not named correctly. Ignoring...", song_type)
+ else:
+ self.logger.critical("Folder %s does not exist!", folder)
+ sys.exit(13)
+ if (len(self.vocals) != len(self.accompaniments)) or (len(self.accompaniments) != len(self.estimated_vocals)) or (len(self.estimated_vocals) != len(self.estimated_accompaniments)):
+ self.logger.critical("Array size mismatch. Did you misname a file?")
+ sys.exit(14)
+
+ # Extracts data from the dataset and sets the correct dimensions
+ def prepare_data(self):
+ self.names = []
+ for element in range(0, len(self.vocals)):
+ self.logger.debug("Processing %s...", self.vocals[element].get_name())
+ self.names.append(self.vocals[element].get_name())
+ self.vocals[element] = np.expand_dims(self.vocals[element].get_raw_data(), 1)
+ self.accompaniments[element] = np.expand_dims(self.accompaniments[element].get_raw_data(), 1)
+ self.estimated_vocals[element] = np.expand_dims(self.estimated_vocals[element].get_raw_data(), 1)
+ self.estimated_accompaniments[element] = np.expand_dims(self.estimated_accompaniments[element].get_raw_data(), 1)
+ self.vocals = np.array(self.vocals)
+ self.accompaniments = np.array(self.accompaniments)
+ self.estimated_vocals = np.array(self.estimated_vocals)
+ self.estimated_accompaniments = np.array(self.estimated_accompaniments)
+ # Since the neural net outputs slightly less data than in the original, we will cut off the part that we can't compare
+ # Simply padding WOULD be a better idea, but we can't assume that the last few miliseconds have nothing going on in them.
+ for element in range(0, len(self.vocals)):
+ if np.shape(self.vocals[element])[0] > np.shape(self.estimated_vocals[element])[0]:
+ self.logger.debug("Reshaping arrays for %s...", self.names[element])
+ difference = np.shape(self.vocals[element])[0] - np.shape(self.estimated_vocals[element])[0]
+ self.vocals[element] = self.vocals[element,:-difference,:]
+ self.accompaniments[element] = self.accompaniments[element,:-difference,:]
+
+ def calculate_metrics(self):
+ sdr = sir = sar = []
+ for element in range(0, len(self.vocals)):
+ original_data = np.stack((self.vocals[element], self.accompaniments[element]))
+ estimated_data = np.stack((self.estimated_vocals[element], self.estimated_accompaniments[element]))
+ museval.metrics.validate(original_data, estimated_data)
+ self.logger.info("Calculating metrics for %s...", self.names[element])
+ obtained_sdr, _, obtained_sir, obtained_sar, _ = museval.metrics.bss_eval(original_data, estimated_data, window=np.inf, hop=0)
+ if element == 0:
+ sdr = obtained_sdr
+ sir = obtained_sir
+ sar = obtained_sar
+ else:
+ sdr = np.column_stack((sdr, obtained_sdr))
+ sir = np.column_stack((sir, obtained_sir))
+ sar = np.column_stack((sar, obtained_sar))
+ return sdr, sir, sar
+
+ def print_metrics(self, sdr, sir, sar):
+ self.logger.info("Printing results...")
+ for element in range(0, len(self.names)):
+ self.logger.info("Song name: %s", self.names[element])
+ self.logger.info("Vocals: SDR: %.2f, SIR: %.2f, SAR: %.2f", sdr[0][element], sir[0][element], sar[0][element])
+ self.logger.info("Accompaniments: SDR: %.2f, SIR: %.2f, SAR: %.2f", sdr[1][element], sir[1][element], sar[1][element])