Multimodal datasets can contain personally identifiable information. We propose a general framework for privacy-aware representation of audio-visual (AV) data.
VidTIMIT (Video Dynamic TIMIT) DeepfakeTIMIT MSP-Improv (Multimodal Sensitive Periods Improvisation Corpus)
- Feature Extraction Using AV-HuBERT
- Privacy Transformer
- Differential privacy filter
- Speaker Recognition
- Emotion Recognition
Method | Accuracy (VidTIMIT |
---|---|
AV-HuBERT | 88.24 (batches of 2 ) |
Differential Privacy filter | 50 (batches of 2 ) |
Transformer Privacy filter | 58 (batches of 2 ) |
Method | F1 Score | Accuracy |
---|---|---|
AV-HuBERT | 41 | 41 |
Differential Privacy filter | 22 | 22 |
Transformer Privacy filter | 36 | 36 |