Skip to content

miras-tech/jspeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

JSPEECH: A Multi-Lingual Conversational Speech Corpus

This corpus contains 1332 hours of conversational speech from 47 different languages and can be used in a variety of studies. Scraped from 106 public chat groups, it allows the study of the effects of device variability, language variability, and speaker and speech variability on the performance of speaker recognition systems and automatic language detection systems.

It is uploaded in google drive and you can download it after completing the Letter of Consent here, and sending a signed copy along with your Gmail address to ali.janalizadeh@outlook.com.

JSpeech Description

JSpeech contains up to 452,007 audio messages scraped from public groups, comprising a total of 1332 hours of converational audio data. The discussions in these groups are unstructured and are conducted with multiple speakers. JSpeech is a multilingual corpus with audio speech data from 47 different languages with over 12140 different speakers. The most notable feature of this audio data is the presence of different uncontrolled environments surrounding the speakers. This is useful in the development of speech technologies that are robust to different kinds of background noise.

The audio data has been downloaded directly from Telegram using the Telethon API in OGG format. Metadata of each file is stored in an SQLite database.

To convert the audio to the WAV format, use the following commands in whilest in the directory containing the OGG files:

apt-get install ffmpeg
ffmpeg -i audio.ogg audio.wav

In order to ensure the diversity and adequacy of the corpus, a set of 106 group chats from different backgrounds and languages were scraped from the public groups of the Telegram messaging application. Each voice message has 6 fields which are described in the Table below.

Field Name Description
Voice_id Unique ID assigned to each voice message
User_id Unique ID assigned to each speaker
Fwd_from ID of user that this message has been forwarded from
Reply_to_msg_id ID of the message this message was replied to
Date Time stamp of each message
Size Size of the voice message (byte)
Duration Duration of the voice message (second)
Chat_name Group name

As shown in the barchart, the majority of the voice messages are made up of speech spoken in English, but there is also a noticeable amount of audio data available in other languages like Farsi, Spanish, and French.

Test Image 3

In addition, you can find the distribution of the number of speakers available for each language.

Test Image 3

If you want to know more information about JSpeech, you can read the paper here.

Corpus Applications

It is expected that with the availability of multilingual speech corpora with different setups of background environments will improve and boost R&D in the field of automatic speaker recognition and voice activity detection.

At Miras Technologies International we are using JSpeech to develop Speaker and Speech Recognition Systems.

Cite

Please cite the following paper in your publication if you are using JSpeech in your research:

@article{choobbastijspeech,
  title={JSPEECH: A MULTI-LINGUAL CONVERSATIONAL SPEECH CORPUS},
  author={Choobbasti, Ali Janalizadeh and Gholamian, Mohammad Erfan and Vaheb, Amir and Safavi, Saeid}
}

About

A Multi-Lingual Conversational Speech Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published