Skip to content

TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18

License

Notifications You must be signed in to change notification settings

chenyangjun45/multimodal-speech-emotion

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

multimodal-speech-emotion

This repository contains the source code used in the following paper,

Multimodal Speech Emotion Recognition using Audio and Text, IEEE SLT-18, [paper]


[requirements]

tensorflow==1.4 (tested on cuda-8.0, cudnn-6.0)
python==2.7
scikit-learn==0.20.0
nltk==3.3

[download data corpus]

  • IEMOCAP [link] [paper]
  • download IEMOCAP data from its original web-page (license agreement is required)

[preprocessed-data schema (our approach)]

  • Get the preprocessed dataset [application link]

    If you want to download the "preprocessed dataset," please ask the license to the IEMOCAP team first.

  • for the preprocessing, refer to codes in the "./preprocessing"

  • We cannot publish ASR-processed transcription due to the license issue (commercial API), however, we assume that it is moderately easy to extract ASR-transcripts from the audio signal by oneself. (we used google-cloud-speech-api)

  • Format of the data for our experiments:

    MFCC : MFCC features of the audio signal (ex. train_audio_mfcc.npy)
    [#samples, 750, 39] - (#sampels, sequencs(max 7.5s), dims)

    MFCC-SEQN : valid lenght of the sequence of the audio signal (ex. train_seqN.npy)
    [#samples] - (#sampels)

    PROSODY : prosody features of the audio signal (ex. train_audio_prosody.npy)
    [#samples, 35] - (#sampels, dims)

    TRANS : sequences of trasnciption (indexed) of a data (ex. train_nlp_trans.npy)
    [#samples, 128] - (#sampels, sequencs(max))

    LABEL : targe label of the audio signal (ex. train_label.npy)
    [#samples] - (#sampels)

[source code]

  • repository contains code for following models

    Audio Recurrent Encoder (ARE)
    Text Recurrent Encoder (TRE)
    Multimodal Dual Recurrent Encoder (MDRE)
    Multimodal Dual Recurrent Encoder with Attention (MDREA)


[training]

  • refer "reference_script.sh"
  • fianl result will be stored in "./TEST_run_result.txt"

[cite]

  • Please cite our paper, when you use our code | model | dataset

    @inproceedings{yoon2018multimodal,
    title={Multimodal Speech Emotion Recognition Using Audio and Text},
    author={Yoon, Seunghyun and Byun, Seokhyun and Jung, Kyomin},
    booktitle={2018 IEEE Spoken Language Technology Workshop (SLT)},
    pages={112--118},
    year={2018},
    organization={IEEE}
    }

About

TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 75.7%
  • Python 22.0%
  • PHP 1.7%
  • Other 0.6%