Skip to content

Commit

Permalink
Published code to github/gitlab
Browse files Browse the repository at this point in the history
  • Loading branch information
zingmars committed Jul 17, 2019
0 parents commit 5d971db
Show file tree
Hide file tree
Showing 17 changed files with 911 additions and 0 deletions.
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
__pycache__/*
*.sw[a-p]
*.ini
log.*
*.log
data/*
test
test-valid
ignored/*
*.save
*.weights
*.out
*.wav
*.h5
*.csv
*.png
7 changes: 7 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Copyright 2019 Ingmars Daniels Melkis <contact@zingmars.me>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
59 changes: 59 additions & 0 deletions README.MD
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Vocal and music seperation using a CNN
===

# Description

This CNN attempts to separate the vocals from the music. It does so by training on the amplitude data of the audio file and tries to estimate where the voiced parts are. Vocal separation is done by generating a binary mask of the time-frequency bins that the network thinks contain the vocals and applying it to the original file.

# Requirements

* Python 3
* Tensorflow (Tested with tensorflow-gpu), Keras
* And a few other python libraries that you can install by running `pip install -r requirements.txt` in the root directory

# Dataset

* The script was only tested with .wav files (16-bit and 24-bit signed wavs should work, 32-bit float doesn't). Other formats might work if your version of librosa is capable of opening it.
* Training data folder should have individual folder for each song. Each song should have two files - `mixture.wav` (the full song) and `vocals.wav` (original vocals). See below for a list of data sets that you could potentially use to train this network.
* To see an example of how the directory structures should look like, refer to [structure.md](structure.md).
* To make things faster, all songs should have the same sampling rate as configured (I only tested 22050kHz, but other sample rates should work) and should be in mono (if it isn't, the script will convert them, but the result isn't saved anywhere and it takes a while).

## Example data sets

* DSD100: https://sigsep.github.io/datasets/dsd100.html
* MUSDB: https://sigsep.github.io/datasets/musdb.html
* MedleyDB: https://medleydb.weebly.com/

# Setting up

1. pip install -r requirements.txt
2. py main.py

# Running

1. `python main.py -h` to see all arguments
2. `python main.py` will train the network with the default options
3. `python main.py --mode=separate --file=audio.wav` will attempt source separation on `audio.wav` and will output `vocals.wav`
4. `python main.py --mode=evaluate` will evaluate the effectiveness of audio source separation. More information below.

## Configuring

All relevant settings are located in the `config.ini` file. The file doesn't exist in the repository and will be automatically created and prepopulated with the default values on first run. For information on what each option does see `config.py`.

## Evaluating

This program also includes a simple wrapper around BSS-Eval which can be used to determine how effective audio source separation is. To use it you need - the original vocals (`vocals.wav`), the original accompaniment (`accompaniment.wav`), estimated vocals (`estimated-vocals.wav`) and estimated accompaniment (`estimated-accompaniment.wav`). If you don't have the accompaniment but have a mixture and vocals, you can use the `apply_vocal_mask.py` script in the `misc` folder. To get estimated accompaniment, you need to perform separation with the `--save_accompaniment` flag set to true. After you have all the files, create a data directory that contains a directory with the name of the song and copy all 4 files to it.

Note that librosa by default outputs a 32bit wav file which it can't load without ffmpeg, so you either need to add an extra conversion step between separating and evaluation or install ffmpeg and add it to your PATH. Both files need to have the same format and bitrate for evaluation to be successful as well.

### Bug

For a reason I haven't had the time to determine yet the neural network output sometimes has slightly less samples (~76 samples to be exact which is around 0.004s worth of samples) than the original. The evaluation script will account for this, but be advised that some samples are being lost during evaluation.

## Weights files when training

While training the network will save its weights every 5 epochs to avoid data loss should you have a power failure or a similar issue. These files may be deleted after training.

## Misc

The misc directory contains a few scripts that might be useful but aren't required to run the neural net.
31 changes: 31 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import configparser

def prepare_config(filename):
config = configparser.ConfigParser()
config.read(filename)

# Set defaults
config_get(config, 'logging', 'logfile', 'log.txt')
config_get(config, 'logging', 'loglevel', 'INFO') #debug,info,warning,critical
config_get(config, 'logging', 'logtype', 'console') #file/console

config_get(config, 'song', 'sample_size', "22050") #Sample rate of the audio we will work with. If loaded audio doesn't match, it will be resampled.
config_get(config, 'song', 'window_size', "1024") #We will get window size / 2 + 1 frequency bins to work with. 1024-1568 seems to be the perfect vales.
config_get(config, 'song', 'hop_length', "256") #Size of each bin = hop size / sample size (in ms). The smaller it is, the more bins we get, but we don't need that much resolution.
config_get(config, 'song', 'sample_length', "25") #Dictates how many frequency bins we give to the neural net for context. Less samples means more guesswork from the network, but also more samples from each song.

config_get(config, 'model', 'save_history', "true") #Saves keras accuracy and loss history per epoch
config_get(config, 'model', 'history_filename', "history.csv")

with open(filename, 'w') as configfile: # If the file didn't exist, write default values to it
config.write(configfile)
return config

def config_get(config, section, key, default):
try:
config.get(section, key)
except configparser.NoSectionError:
config.add_section(section)
config.set(section, key, default)
except configparser.NoOptionError:
config.set(section, key, default)
76 changes: 76 additions & 0 deletions dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import os
import sys
import logging
from song import Song
import numpy as np

# Dataset: Loads and passes test data to the model
class Dataset:
def __init__(self, logger, config):
self.logger=logger
self.config=config
# Raw data
self.mixtures = []
self.vocals = []
# Outputs for CNN
self.mixture_windows = []
self.labels = []

# Load mixture and vocals and generates STFT for them
def load(self, folder):
if os.path.isdir(folder):
for root, dirs, files in os.walk(folder):
for file in filter(lambda f: f.endswith(".wav"), files):
self.logger.info("Loading song %s and computing stft for it.", os.path.join(root, file))
song_type = os.path.splitext(file)[0].lower()
if song_type == "mixture" or song_type == "vocals":
song = Song(self.logger, os.path.basename(root), self.config)
song.load_file(os.path.join(root,file))
song.compute_stft()
if(song_type == "mixture"):
self.mixtures.append(song)
elif(song_type == "vocals"):
self.vocals.append(song)
self.logger.debug("%s loaded successfully.", song_type)
else:
self.logger.debug("File %s is not named correctly. Ignoring...", song_type)
else:
self.logger.critical("Folder %s does not exist!", folder)
sys.exit(8)
if(len(self.mixture) != len(self.vocals)):
self.logger.critical("There doesn't appear to be a vocal track for each mixture (or the other way around).")
sys.exit(15)

def get_data_for_cnn(self):
length = self.config.getint("song", "sample_length")
self.logger.info("Preparing data of type 'mixture' for the CNN...")
if len(self.mixtures) == 0:
self.logger.critical("No mixtures for training found. Did you name them wrong?")
sys.exit(9)
self.logger.debug("Preparing %i songs...", len(self.mixtures))
amplitudes = None
for num in range(0, len(self.mixtures)):
if amplitudes is None:
amplitudes = self.mixtures[0].split_spectrogram(length)
else:
amplitudes = np.vstack((amplitudes, self.mixtures[0].split_spectrogram(length)))
del self.mixtures[0]
self.logger.debug("Got %i slices. Each slice has %i frequency bins, and each frequency bin has %i time slices.", len(amplitudes), len(amplitudes[0]), len(amplitudes[0][0]))
self.logger.debug("Adding a 4th dimension to placate the CNN model...")
# Add a dimension to make the CNN accept the data. Signifies that we have a greyscale "picture"
amplitudes = np.array(amplitudes).reshape(len(amplitudes), len(amplitudes[0]), len(amplitudes[0][0]), 1)
self.mixture_windows = amplitudes

def get_labels_for_cnn(self):
length = self.config.getint("song", "sample_length")
self.logger.info("Preparing data of type 'vocals' for the CNN...")
if len(self.vocals) == 0:
self.logger.critical("No original vocals for training found. Did you name them wrong?")
sys.exit(10)
self.logger.debug("Preparing %i songs...", len(self.vocals))
labels = []
for num in range(0, len(self.vocals)):
labels.extend(self.vocals[0].get_labels(length))
del self.vocals[0]
self.logger.debug("Got %i slices.", len(labels))
self.labels = np.array(labels)
96 changes: 96 additions & 0 deletions evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Evaluate the accuracy of the neural network by calculating SDR (distortion)
# SIR (interference from other sources) and SAR (artifacts)
import numpy as np
import museval
import os
import sys
from song import Song

class Evaluator:
def __init__(self, logger, config):
self.logger=logger
self.config=config
self.vocals=None
self.accompaniments=None
self.estimated_vocals=None
self.estimated_accompaniments=None
self.names=None

def load_data(self, folder):
self.vocals=[]
self.accompaniments=[]
self.estimated_vocals=[]
self.estimated_accompaniments=[]
if os.path.isdir(folder):
for root, firs, files in os.walk(folder):
for file in filter(lambda f: f.endswith(".wav"), files):
song_type = os.path.splitext(file)[0].lower()
self.logger.info("Loading song %s.", os.path.join(root, file))
if song_type == "vocals" or song_type == "accompaniment" or song_type == "estimated_vocals" or song_type == "estimated_accompaniment":
song = Song(self.logger, os.path.basename(root), self.config)
song.load_file(os.path.join(root,file))
if(song_type == "vocals"):
self.vocals.append(song)
elif(song_type == "accompaniment"):
self.accompaniments.append(song)
elif(song_type == "estimated_vocals"):
self.estimated_vocals.append(song)
elif(song_type == "estimated_accompaniment"):
self.estimated_accompaniments.append(song)
self.logger.debug("%s loaded successfully.", song_type)
else:
self.logger.debug("File %s is not named correctly. Ignoring...", song_type)
else:
self.logger.critical("Folder %s does not exist!", folder)
sys.exit(13)
if (len(self.vocals) != len(self.accompaniments)) or (len(self.accompaniments) != len(self.estimated_vocals)) or (len(self.estimated_vocals) != len(self.estimated_accompaniments)):
self.logger.critical("Array size mismatch. Did you misname a file?")
sys.exit(14)

# Extracts data from the dataset and sets the correct dimensions
def prepare_data(self):
self.names = []
for element in range(0, len(self.vocals)):
self.logger.debug("Processing %s...", self.vocals[element].get_name())
self.names.append(self.vocals[element].get_name())
self.vocals[element] = np.expand_dims(self.vocals[element].get_raw_data(), 1)
self.accompaniments[element] = np.expand_dims(self.accompaniments[element].get_raw_data(), 1)
self.estimated_vocals[element] = np.expand_dims(self.estimated_vocals[element].get_raw_data(), 1)
self.estimated_accompaniments[element] = np.expand_dims(self.estimated_accompaniments[element].get_raw_data(), 1)
self.vocals = np.array(self.vocals)
self.accompaniments = np.array(self.accompaniments)
self.estimated_vocals = np.array(self.estimated_vocals)
self.estimated_accompaniments = np.array(self.estimated_accompaniments)
# Since the neural net outputs slightly less data than in the original, we will cut off the part that we can't compare
# Simply padding WOULD be a better idea, but we can't assume that the last few miliseconds have nothing going on in them.
for element in range(0, len(self.vocals)):
if np.shape(self.vocals[element])[0] > np.shape(self.estimated_vocals[element])[0]:
self.logger.debug("Reshaping arrays for %s...", self.names[element])
difference = np.shape(self.vocals[element])[0] - np.shape(self.estimated_vocals[element])[0]
self.vocals[element] = self.vocals[element,:-difference,:]
self.accompaniments[element] = self.accompaniments[element,:-difference,:]

def calculate_metrics(self):
sdr = sir = sar = []
for element in range(0, len(self.vocals)):
original_data = np.stack((self.vocals[element], self.accompaniments[element]))
estimated_data = np.stack((self.estimated_vocals[element], self.estimated_accompaniments[element]))
museval.metrics.validate(original_data, estimated_data)
self.logger.info("Calculating metrics for %s...", self.names[element])
obtained_sdr, _, obtained_sir, obtained_sar, _ = museval.metrics.bss_eval(original_data, estimated_data, window=np.inf, hop=0)
if element == 0:
sdr = obtained_sdr
sir = obtained_sir
sar = obtained_sar
else:
sdr = np.column_stack((sdr, obtained_sdr))
sir = np.column_stack((sir, obtained_sir))
sar = np.column_stack((sar, obtained_sar))
return sdr, sir, sar

def print_metrics(self, sdr, sir, sar):
self.logger.info("Printing results...")
for element in range(0, len(self.names)):
self.logger.info("Song name: %s", self.names[element])
self.logger.info("Vocals: SDR: %.2f, SIR: %.2f, SAR: %.2f", sdr[0][element], sir[0][element], sar[0][element])
self.logger.info("Accompaniments: SDR: %.2f, SIR: %.2f, SAR: %.2f", sdr[1][element], sir[1][element], sar[1][element])
Loading

0 comments on commit 5d971db

Please sign in to comment.