OpenSesame is a software for speaker identification and speach recognition system. It leverages machine lerning, namely a Neural Network and a Support Vector Machine to identify if the correct speaker says the correct keyword, such as "open sesame". It records and identifies the speaker in real time. After detecting the speaker and the key word, the program is to unlock the protected data.
OpenSesame consists of 4 parts: 1) live recording, 2) neural network, 3) support vector machine, and 4) the decision block, as shown in Fig. 1. In the following, we will shortly introduce the main components of the program.
-
Live Recording: The live recordings are saved in three waveform audio files, that overlap 1.34 seconds (2/3 of the recording time, here 2 seconds) with the preceeding recording, to catch the case if the keyword is split onto two different recordings. The prediction is then run over the three recordings.
-
Neural Network (NN): The neural network computes a prediction value of between 0 and 1. If it surpasses a certain threshold (here
THRESHOLD_NN=0.6
), the system recognises that the correct speaker said the correct keyword. -
Support Vector Machine (SVM): The SVM computes a prediction value of between 0 and 1. If it surpasses a certain threshold of (here
THRESHOLD_SVM=0.75
), the system recognises that the correct speaker said the correct keyword. -
Decision: If both thresholds of the NN and the SVM are supassed, then and only then the system unlocks. Using two different models, gives us a kind of fail-safe, for the case that the NN or the SVM somehow predicts a high value, eventhoug it should not have. The unlock screen, as shown in Fig. 2.
The Neural Network used for OpenSesame is a Feed-Forward neural network, that implents 6 Dense layers. The first layer expands the feature vector from 40 to 256 dimensions. Every following layer decreases the dimensionality by a power of 2, namely 128, 64, 32, 16, 1. All layers use a Relu activation function, except the last layer. It uses a Sigmoid activation, which gives us a value that represents a probability (bewteen 0 and 1) if the correct speaker says the correct keyword. The model is trained on 40 epochs. In Fig. 3, we can see that the converges 98% accuracy on the validation dataset.
Figure 3: Accuracy and Loss of the Training process
Since we unpack every recording into 197 individual vectors, we evaluate the trained model twice. Firstly, how well did the model generalise on all samples and how well did the model generalise on the recodings?
Figure 4: Confusion Matrix of Prediction on all samples | Figure 5: Confusion Matrix of Prediction on all recordings |
---|---|
Performance Metrics on individual samples:
Classificaiton Report over all sample vectors:
precision recall f1-score support
0.0 0.68 0.78 0.73 2919
1.0 0.81 0.72 0.76 3779
accuracy 0.75 6698
macro avg 0.75 0.75 0.74 6698
weighted avg 0.75 0.75 0.75 6698
Performance Metrics on recordings:
Classificaiton Report over all recordings:
precision recall f1-score support
0.0 0.88 1.00 0.94 15
1.0 1.00 0.89 0.94 19
accuracy 0.94 34
macro avg 0.94 0.95 0.94 34
weighted avg 0.95 0.94 0.94 34
As we can see our model performs well on predicting, if an entire recording contains the correct keyword, by the correct user. Therefore, it is not as important to look at the results of the classification of every sample. Here we can see that the model generalises well on before unseen data and achieves an accuracy of 94%.
The SVM model implements the default SVC()
model provided by the sklearn library, which uses a radial basis function as the kernel. After training we get the following results:
Figure 5: Confusion Matrix of Prediction on all samples | Figure 6: Confusion Matrix of Prediction on all recordings |
---|---|
Performance metrics on all sample:
Classificaiton Report over all sample vectors:
precision recall f1-score support
0.0 0.69 0.76 0.72 3042
1.0 0.78 0.71 0.75 3656
accuracy 0.73 6698
macro avg 0.73 0.74 0.73 6698
weighted avg 0.74 0.73 0.74 6698
Performance metrics on recordings:
Classification Report over all recordings:
precision recall f1-score support
0.0 0.76 1.00 0.87 13
1.0 1.00 0.81 0.89 21
accuracy 0.88 34
macro avg 0.88 0.90 0.88 34
weighted avg 0.91 0.88 0.88 34
The SVM does not classify as well on the test data, compared to the the NN. For our use-case it does not have to perform as well, as the neural network. However, it is still necessary to integerate the models for safety reasons. If we were to only rely on the NN, the probability that unathorised access is granted is higher, in case of a false positive. In contrast, if we use both models, they both have to classify the speaker as the correct one and identify the keyword. This feature gives us additional protection against possible intruders.
For training we collected 156 recordings, which is split into 50% positive and 50% negative samples. Each recording is split into 197 vectors, which are fed to the model during training. This results in 30,732 training samples per epoch. For testing our model, we use the exact same strategy as for training.
Recordings for Training | Recordings for Testing | |
---|---|---|
Positive | 78 | 17 |
Negative | 78 | 17 |
Total | 156 | 34 |
├── README.md
├── data
│ ├── live
│ ├── test
│ └── train
├── images
├── requirements.txt
└── src
├── ascii_art
│ ├── closed_lock.txt
│ └── open_lock.txt
├── feed-forward
│ ├── feed-forward_train.py
│ ├── models
│ ├── old
│ └── plots
├── gmm
│ ├── test_gmm.py
│ ├── testing_set
│ ├── train_gmm.py
│ ├── trained_models
│ └── training_set
├── main.py
├── rnn
│ ├── models
│ ├── plots
│ ├── rnn-validate.py
│ └── rnn_train.py
├── svm
│ ├── models
│ ├── old
│ ├── plots
│ └── svm_train.py
└── utils
├── __init__.py
├── im2a.py
├── preprocess_data.py
└── record.py
All code is stored in the ./src
directory. It contains the main.py
file, that contains the actual program, implementing the logic and the used models. Additionally, there are further models (i.e. ./src/rnn
), that were used for training, however were not seen as feasible due to computation costs or other factors that had to be taken into account.
Every directory named after a model, i.e. ./src/feed-forward
contain the training file of the models and the trained models, that are used in main.py
.
The ./src/utils
directory contains all the files needed to record and preprocess our training and testing data.
Lastly, the ./requirements.txt
files specify which libraries we used, if you want to use this repository.
This project is part of our graduate program, as part of the Business Intelligence Lecture. The team behind it consists of three graudate engineering students, currently enrolled in the EIT Autonomous Systems program at Polytech Nice-Sophia.
- Filippo Zeggio: https://github.com/curcuman
- Philipp Ahrendt: https://github.com/phiahr
- Dalim Wahby: https://github.com/citrovin