MirasVoice : A Bilingual (English-Farsi) Speech Corpus

This repository contains a sample of the MirasVoice corpus and description. A sample of the dataset contains 100 minutes of English and Farsi voices (both males and females). It is uploaded in google drive and you can download it here.
In order to use the complete dataset and text materials you need to submit a request to amir@miras-tech.com.

MirasVoice Description

MirasVoice contains 33 hours of audio data from 50 individuals that are native Farsi speakers but also fluent in English. Approximately 40 minutes of audio data exist per speaker. 20 minutes of this audio is in English and the other 20 minutes is in Farsi.

The text material read by the volunteers includes a number of words, sentences, and numbers in English. This text has been translates into Farsi by educated native speakers. This translated text has also been read by all participants. As shown in table, There are 250 words, 63 sentences and 80 numbers in the text.

Context of Text Material	Material Amount
Words	250
Sentences	63
Numbers	80
Questions	17

We also gathered information on whether the participants smoked or not, their blood pressure, age, height, accent, birth country, mothers birthplace (province), fathers birthplace (province), time of recording and which province they grew up in.

The audio data has been recorded using a microphone with a sample rate of 48kHz, a frequency response of 20Hz to 20kHz, a max Sound Pressure Level (SPL) of 120db and a bit rate of 16 bits.

If you want to know more information about MirasVoice, you can read the paper here.

Corpus Applications

At Miras Technologies International we are using MirasVoice to develop Speaker Verification Systems. MirasVoice is a bilingual speech corpus and can be used for :

Speaker Identification
Speech Recognition

Please inform us if you have used MirasVoice for any porpuses to be added to this list.

Cite

Please cite the following paper in your publication if you are using MirasVoice in your research:

@InProceedings{VAHEB18.443,
  author = {Amir Vaheb and Ali Janalizadeh Choobbasti and Mahdi Mortazavi and Saeid Safavi and Behnam Sabeti},
  title = "{MirasVoice: A bilingual (English-Persian) speech corpus}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and H�l�ne Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MirasVoice : A Bilingual (English-Farsi) Speech Corpus

MirasVoice Description

Corpus Applications

Cite

About

Releases

Packages

Contributors 2

License

miras-tech/MirasVoice

Folders and files

Latest commit

History

Repository files navigation

MirasVoice : A Bilingual (English-Farsi) Speech Corpus

MirasVoice Description

Corpus Applications

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages