-
Notifications
You must be signed in to change notification settings - Fork 145
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* draft is done. backend test is done * more work on spectrogram test * all works well except saving filterbank layer * fixed filterbank issue, add more test * update readme * update readme * update notebook * fix code in readme * add scripts * add pip install on readme Co-authored-by: Keunwoo Choi <keunwoochoi@KCs-qmul-mbp.local> Co-authored-by: keunwoochoi <gnuchoi+github@gmail.com`>
- Loading branch information
1 parent
bb0ea2c
commit 8cdbb16
Showing
22 changed files
with
1,149 additions
and
2,425 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,304 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# How to use Kapre - example" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"2020/8/14\n", | ||
"Tensorflow: 2.3.0\n", | ||
"Librosa: 0.8.0\n", | ||
"Image data format: channels_last\n", | ||
"Kapre: 0.3.0-rc\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import librosa\n", | ||
"import kapre\n", | ||
"import tensorflow as tf\n", | ||
"from tensorflow.keras.models import Sequential\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"from datetime import datetime\n", | ||
"now = datetime.now()\n", | ||
"\n", | ||
"print('%s/%s/%s' % (now.year, now.month, now.day))\n", | ||
"print('Tensorflow: {}'.format(tf.__version__))\n", | ||
"print('Librosa: {}'.format(librosa.__version__))\n", | ||
"print('Image data format: {}'.format(tf.keras.backend.image_data_format()))\n", | ||
"print('Kapre: {}'.format(kapre.__version__))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Loading an mp3 file" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 19, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Audio length: 453888 samples, 10.29 seconds. \n", | ||
"Audio sample rate: 44100 Hz\n" | ||
] | ||
}, | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"/Users/keunwoochoi/miniconda3/envs/kapre/lib/python3.7/site-packages/librosa/core/audio.py:162: UserWarning: PySoundFile failed. Trying audioread instead.\n", | ||
" warnings.warn(\"PySoundFile failed. Trying audioread instead.\")\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"src, sr = librosa.load('../srcs/bensound-cute.mp3', sr=None, mono=True)\n", | ||
"print('Audio length: %d samples, %04.2f seconds. \\n' % (len(src), len(src) / sr) +\n", | ||
" 'Audio sample rate: %d Hz' % sr)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Trim it and make it a 2d.\n", | ||
"\n", | ||
"If your file is mono, librosa.load returns a 1D array. Kapre always expects 2d array, so make it 2d.\n", | ||
"\n", | ||
"On my computer, I use default `image_data_format == 'channels_last'`. I'll keep it in that way for the audio data, too." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 20, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"The shape of an item (44100, 1)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"len_second = 1.0 # Let's trim it to make it quick\n", | ||
"src = src[:int(sr*len_second)]\n", | ||
"src = np.expand_dims(src, axis=1)\n", | ||
"input_shape = src.shape\n", | ||
"print('The shape of an item', input_shape)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Let's make it a batch of 4 items\n", | ||
"\n", | ||
"to make it more like a proper dataset. You should have many files indeed." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 21, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"The shape of a batch: (4, 44100, 1)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"x = np.array([src] * 4)\n", | ||
"print('The shape of a batch: ',x.shape)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# A Keras model\n", | ||
"\n", | ||
"A simple model with 10-class and single-label classification." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 33, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Model: \"sequential_5\"\n", | ||
"_________________________________________________________________\n", | ||
"Layer (type) Output Shape Param # \n", | ||
"=================================================================\n", | ||
"stft-layer (STFT) (None, 42, 1025, 1) 0 \n", | ||
"=================================================================\n", | ||
"Total params: 0\n", | ||
"Trainable params: 0\n", | ||
"Non-trainable params: 0\n", | ||
"_________________________________________________________________\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from kapre.time_frequency import STFT, Magnitude, MagnitudeToDecibel\n", | ||
"\n", | ||
"\n", | ||
"model = Sequential()\n", | ||
"# A STFT layer\n", | ||
"model.add(STFT(n_fft=2048, win_length=2018, hop_length=1024,\n", | ||
" window_fn=None, pad_end=False,\n", | ||
" input_data_format='channels_last', output_data_format='channels_last',\n", | ||
" input_shape=input_shape,\n", | ||
" name='stft-layer'))\n", | ||
"model.summary()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"- The model has no trainable parameters because `STFT` layer uses `tf.signal.stft()` function which is just an implementation of FFT-based short-time Fourier transform.\n", | ||
"- The output shape is `(batch, time, frequency, channels)`. \n", | ||
" - `42` (time) is the number of STFT frames. A shorter hop length would make it (nearly) proportionally longer. If `pad_end=True`, the input audio signals become a little longer, hence the number of frames would get longer, too.\n", | ||
" - `1025` is the number of STFT bins and decided as `n_fft // 2 + 1`. \n", | ||
" - `1` channel: because the input signal was single-channel.\n", | ||
"- The output of `STFT` layer is `complex` number." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's add more layers like a real model!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 34, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU, GlobalAveragePooling2D, Dense, Softmax" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 35, | ||
"metadata": { | ||
"scrolled": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"model.add(Magnitude())\n", | ||
"model.add(MagnitudeToDecibel())\n", | ||
"model.add(Conv2D(32, (3, 3), strides=(2, 2)))\n", | ||
"model.add(BatchNormalization())\n", | ||
"model.add(ReLU())\n", | ||
"model.add(GlobalAveragePooling2D())\n", | ||
"model.add(Dense(10))\n", | ||
"model.add(Softmax())\n", | ||
"\n", | ||
"# Compile the model\n", | ||
"model.compile('adam', 'categorical_crossentropy') # if single-label classification\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 36, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Model: \"sequential_5\"\n", | ||
"_________________________________________________________________\n", | ||
"Layer (type) Output Shape Param # \n", | ||
"=================================================================\n", | ||
"stft-layer (STFT) (None, 42, 1025, 1) 0 \n", | ||
"_________________________________________________________________\n", | ||
"magnitude (Magnitude) (None, 42, 1025, 1) 0 \n", | ||
"_________________________________________________________________\n", | ||
"magnitude_to_decibel (Magnit (None, 42, 1025, 1) 0 \n", | ||
"_________________________________________________________________\n", | ||
"conv2d_1 (Conv2D) (None, 20, 512, 32) 320 \n", | ||
"_________________________________________________________________\n", | ||
"batch_normalization (BatchNo (None, 20, 512, 32) 128 \n", | ||
"_________________________________________________________________\n", | ||
"re_lu (ReLU) (None, 20, 512, 32) 0 \n", | ||
"_________________________________________________________________\n", | ||
"global_average_pooling2d (Gl (None, 32) 0 \n", | ||
"_________________________________________________________________\n", | ||
"dense (Dense) (None, 10) 330 \n", | ||
"_________________________________________________________________\n", | ||
"softmax (Softmax) (None, 10) 0 \n", | ||
"=================================================================\n", | ||
"Total params: 778\n", | ||
"Trainable params: 714\n", | ||
"Non-trainable params: 64\n", | ||
"_________________________________________________________________\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"model.summary()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"- I added `Magnitude()` which is a simple `abs()` operation on the complex numbers.\n", | ||
"- `MagnitudeToDecibel` maps the numbers to a decibel scale." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.7" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,6 @@ | ||
__version__ = '0.2.0' | ||
__version__ = '0.3.0' | ||
VERSION = __version__ | ||
|
||
from . import time_frequency | ||
from . import composed | ||
from . import backend | ||
from . import backend_keras | ||
|
||
from . import augmentation | ||
from . import filterbank | ||
from . import utils |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
from .time_frequency import STFT, Magnitude, Phase, MagnitudeToDecibel, ApplyFilterbank | ||
from . import backend | ||
|
||
from tensorflow.keras import Sequential | ||
|
||
|
||
def get_melspectrogram_layer( | ||
input_shape=None, | ||
n_fft=2048, | ||
win_length=None, | ||
hop_length=None, | ||
window_fn=None, | ||
pad_end=False, | ||
sample_rate=22050, | ||
n_mels=128, | ||
mel_f_min=0.0, | ||
mel_f_max=None, | ||
mel_htk=False, | ||
mel_norm='slaney', | ||
return_decibel=False, | ||
db_amin=1e-5, | ||
db_ref_value=1.0, | ||
db_dynamic_range=80.0, | ||
input_data_format='default', | ||
output_data_format='default', | ||
): | ||
"""A function that retunrs a melspectrogram layer, which is a Sequential model consists of | ||
`STFT`, `Magnitude`, `ApplyFilterbank(_mel_filterbank)`, and optionally `MagnitudeToDecibel`. | ||
Args: | ||
input_shape (None or tuple of integers): input shape of the model if this melspectrogram layer is | ||
is the first layer of your model (see `keras.model.Sequential()` for more details) | ||
n_fft (int): number of FFT points in `STFT` | ||
win_length (int): window length of `STFT` | ||
hop_length (int): hop length of `STFT` | ||
window_fn (function or None): windowing function of `STFT`. | ||
Defaults to `None`, which would follow tf.signal.stft default (hann window at the moment) | ||
pad_end (bool): whether to pad the input signal at the end in `STFT`. | ||
sample_rate (int): sample rate of the input audio | ||
n_mels (int): number of mel bins in the mel filterbank | ||
mel_f_min (float): lowest frequency of the mel filterbank | ||
mel_f_max (float): highest frequency of the mel filterbank | ||
mel_htk (bool): whether to follow the htk mel filterbank fomula or not | ||
mel_norm ('slaney' or int): normalization policy of the mel filterbank triangles | ||
return_decibel (bool): whether to apply decibel scaling at the end | ||
db_amin (float): noise floor of decibel scaling input. See `MagnitudeToDecibel` for more details. | ||
db_ref_value (float): reference value of decibel scaling. See `MagnitudeToDecibel` for more details. | ||
db_dynamic_range (float): dynamic range of the decibel scaling result. | ||
input_data_format (str): the audio data format of input waveform batch. | ||
`'channels_last'` if it's `(batch, time, channels)` | ||
`'channels_first'` if it's `(batch, channels, time)` | ||
Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format()) | ||
output_data_format (str): the data format of output mel spectrogram. | ||
`'channels_last'` if you want `(batch, time, frequency, channels)` | ||
`'channels_first'` if you want `(batch, channels, time, frequency)` | ||
Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format()) | ||
""" | ||
waveform_to_stft = STFT( | ||
n_fft=n_fft, | ||
win_length=win_length, | ||
hop_length=hop_length, | ||
window_fn=window_fn, | ||
pad_end=pad_end, | ||
input_data_format=input_data_format, | ||
output_data_format=output_data_format, | ||
input_shape=input_shape, | ||
) | ||
|
||
stft_to_stftm = Magnitude() | ||
|
||
kwargs = { | ||
'sample_rate': sample_rate, | ||
'n_freq': n_fft // 2 + 1, | ||
'n_mels': n_mels, | ||
'f_min': mel_f_min, | ||
'f_max': mel_f_max, | ||
'htk': mel_htk, | ||
'norm': mel_norm, | ||
} | ||
stftm_to_melgram = ApplyFilterbank( | ||
type='mel', filterbank_kwargs=kwargs, data_format=output_data_format | ||
) | ||
|
||
layers = [waveform_to_stft, stft_to_stftm, stftm_to_melgram] | ||
if return_decibel: | ||
mag_to_decibel = MagnitudeToDecibel( | ||
ref_value=db_ref_value, amin=db_amin, dynamic_range=db_dynamic_range | ||
) | ||
layers.append(mag_to_decibel) | ||
|
||
return Sequential(layers) | ||
|
||
|
||
def get_log_frequency_spectrogram_layer( | ||
input_shape=None, | ||
n_fft=2048, | ||
win_length=None, | ||
hop_length=None, | ||
window_fn=None, | ||
pad_end=False, | ||
sample_rate=22050, | ||
log_n_bins=84, | ||
log_f_min=None, | ||
log_bins_per_octave=12, | ||
log_spread=0.125, | ||
return_decibel=False, | ||
db_amin=1e-5, | ||
db_ref_value=1.0, | ||
db_dynamic_range=80.0, | ||
input_data_format='default', | ||
output_data_format='default', | ||
): | ||
"""A function that retunrs a log-frequency STFT layer, which is a Sequential model consists of | ||
`STFT`, `Magnitude`, `ApplyFilterbank(_log_filterbank)`, and optionally `MagnitudeToDecibel`. | ||
Args: | ||
input_shape (None or tuple of integers): input shape of the model if this melspectrogram layer is | ||
is the first layer of your model (see `keras.model.Sequential()` for more details) | ||
n_fft (int): number of FFT points in `STFT` | ||
win_length (int): window length of `STFT` | ||
hop_length (int): hop length of `STFT` | ||
window_fn (function or None): windowing function of `STFT`. | ||
Defaults to `None`, which would follow tf.signal.stft default (hann window at the moment) | ||
pad_end (bool): whether to pad the input signal at the end in `STFT`. | ||
sample_rate (int): sample rate of the input audio | ||
log_n_bins (int): number of the bins in the log-frequency filterbank | ||
log_f_min (float): lowest frequency of the filterbank | ||
log_bins_per_octave (int): number of bins in each octave in the filterbank | ||
log_spread (float): spread constant (Q value) in the log filterbank. | ||
return_decibel (bool): whether to apply decibel scaling at the end | ||
db_amin (float): noise floor of decibel scaling input. See `MagnitudeToDecibel` for more details. | ||
db_ref_value (float): reference value of decibel scaling. See `MagnitudeToDecibel` for more details. | ||
db_dynamic_range (float): dynamic range of the decibel scaling result. | ||
input_data_format (str): the audio data format of input waveform batch. | ||
`'channels_last'` if it's `(batch, time, channels)` | ||
`'channels_first'` if it's `(batch, channels, time)` | ||
Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format()) | ||
output_data_format (str): the data format of output mel spectrogram. | ||
`'channels_last'` if you want `(batch, time, frequency, channels)` | ||
`'channels_first'` if you want `(batch, channels, time, frequency)` | ||
Defaults to the setting of your Keras configuration. (tf.keras.backend.image_data_format()) | ||
""" | ||
waveform_to_stft = STFT( | ||
n_fft=n_fft, | ||
win_length=win_length, | ||
hop_length=hop_length, | ||
window_fn=window_fn, | ||
pad_end=pad_end, | ||
input_data_format=input_data_format, | ||
output_data_format=output_data_format, | ||
input_shape=input_shape, | ||
) | ||
|
||
stft_to_stftm = Magnitude() | ||
|
||
_log_filterbank = backend.filterbank_log( | ||
sample_rate=sample_rate, | ||
n_freq=n_fft // 2 + 1, | ||
n_bins=log_n_bins, | ||
bins_per_octave=log_bins_per_octave, | ||
f_min=log_f_min, | ||
spread=log_spread, | ||
) | ||
kwargs = { | ||
'sample_rate': sample_rate, | ||
'n_freq': n_fft // 2 + 1, | ||
'n_bins': log_n_bins, | ||
'bins_per_octave': log_bins_per_octave, | ||
'f_min': log_f_min, | ||
'spread': log_spread, | ||
} | ||
|
||
stftm_to_loggram = ApplyFilterbank( | ||
type='log', filterbank_kwargs=kwargs, data_format=output_data_format | ||
) | ||
|
||
layers = [waveform_to_stft, stft_to_stftm, stftm_to_loggram] | ||
|
||
if return_decibel: | ||
mag_to_decibel = MagnitudeToDecibel( | ||
ref_value=db_ref_value, amin=db_amin, dynamic_range=db_dynamic_range | ||
) | ||
layers.append(mag_to_decibel) | ||
|
||
return Sequential(layers) |
This file was deleted.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
black kapre | ||
black tests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/bin/bash | ||
|
||
python setup.py sdist | ||
pip install twine | ||
twine upload dist/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters