This project tackles the complex challenge of identifying emotions from voice recordings. Emotions are inherently subjective and typically inferred from visual cues like facial expressions and body language, making voice-based recognition a difficult task. Our goal is to create a model capable of effectively classifying the emotional tone in vocal expressions.
- A CNN designed to analyze audio files' Mel Spectrograms.
- A CNN focusing on Mel Frequency Cepstral Coefficients (MFCCs) of the audio files.
- A CRNN that also works with MFCCs.
- Gathering Data
- Data Organization and Cleaning
- Data Exploration, Preparation, and Visualization
- Data Preprocessing
- Model Implementation
All these components are detailed in the speech_emotion_recognition.ipynb Jupyter notebook.
The Mel Spectrogram CNN was effective but struggled to differentiate some emotions. The CNN using MFCCs was the most successful, suggesting MFCCs are better for emotion recognition in audio. The CRNN with MFCCs also showed good results but was prone to overfitting and didn't surpass the MFCCs CNN.
The models were assessed using Precision, Recall, and F1 scores, offering a more nuanced understanding of their effectiveness beyond mere accuracy. The MFCCs CNN model emerged as the top performer, as evidenced by its highest scores in these metrics.