Skip to content

Speaker diarization

Niko Partanen edited this page May 23, 2017 · 4 revisions

Note: Unfinished text! Niko Partanen keeps the rights to publish this as part of an article later, but comments and edits will be acknowledged. Please contact Niko Partanen before doing major changes.

Speaker diarization refers to the task where different speakers are detected from one another on the recording. It is connected to segmentation, and somehow these tasks must be done hand in hand. I'm no expert in it though, but it seems obvious that good speaker diarization would be very useful.

I will try speaker diarization with these instructions.

conda create --name diarization_test python=2.7 scipy matplotlib scikit-learn simplejson
pip install --user eyeD3
pip install --user pathlib
pip install --user pydub
pip install --user hmmlearn

This creates a new Anaconda environment where Python 2.7 is used and scipy, matplotlib and scikit-learn packages are installed. Some things had to be installed with pip. I guess it doesn't really matter which tool one uses to manage the environments, but it is really useful to have for each small project a new environment which can also be easily shared with others. That said, Python package management is kind of a mystery to me at the moment, but with some setups like this I can get it to work nicely.

source activate diarization_test

Now we can run:

python audioAnalysis.py speakerDiarization -i data/diarizationExample.wav --num 4

Which produces something like this:

Speaker diarization plot

Of course this is going to be a bit more complicated when we deal with very long files, but the principle should be similar.

I'm assuming now that creating segmentations + forced alignation requires some clever combination of:

  • silence detection
  • speaker diarization
  • forced alignation
  • speech recognition (?)

All these we have, and all produce information about the structure of the audio file. And we can do all this from Python very easily, so having it run in a script with needed files as input should be easy. Or if the input is an ELAN file, we can assume there are CMDI and WAV files in the same directory with different filenames.

Creating training files

It is possible to train speaker diarization model from following kind of files:

0.01,9.90,speech
9.90,10.70,silence
10.70,23.50,speech
23.50,184.30,music
184.30,185.10,silence
185.10,200.75,speech
...

We don't maybe need to recognize the music parts here, although that sounds also interesting, but in principle it should be entirely possible to turn each ELAN file which is manually segmented into this kind of a training file with speech and silence segments.

Clone this wiki locally