Speech keyword detection is a deep learning model that recognizes a keyword when spoken.
- librosa
- numpy
- pandas
- scikit-learn
- tensorflow>=1.15
pip install -r requirements.txt
Extract Mel-spectrogram and MFCC features from the audio dataset.
python ./utils/feature_extraction.py
python train.py --model [MODEL_TYPE] --data [DATA_FEATURE_TYPE]
python test.py --model [MODEL_TYPE] --data [DATA_FEATURE_TYPE]
- FC DNN
- CNN
- ResNet
The experiment was designed specifically to run on a small embedded system such as NVIDIA Jetson Nano 2GB.
In this experiment, the dataset was created manually. It is created with three different mic location and three different reverberation time, providing nine different combination. Also, it is combined with six different genre of TV programs.
- Training dataset = 24,006
- Testing dataset = 31.968
CNN with no-normalization and log mels feature extracted was the best model in this experiment.
Normalization | Feature | Model | Validation Accuracy | EER |
---|---|---|---|---|
No-Normalization | Log Mels | CNN | 95.75% | 4.04% |
FC DNN | - | - | ||
ResNet | 88.59% | 13.60% | ||
MFCC | CNN | 77.43% | 15.32% | |
FC DNN | - | - | ||
ResNet | 92.08% | 7.91% | ||
Max-Normalization | Log Mels | CNN | 89.15% | 13.37% |
FC DNN | 91.58% | 8.30% | ||
ResNet | 81.06% | 18.58% | ||
MFCC | CNN | 81.13% | 15.78% | |
FC DNN | 90.33% | 17.18% | ||
ResNet | 87.24% | 12.84% |