JSPEECH: A Multi-Lingual Conversational Speech Corpus

This corpus contains 1332 hours of conversational speech from 47 different languages and can be used in a variety of studies. Scraped from 106 public chat groups, it allows the study of the effects of device variability, language variability, and speaker and speech variability on the performance of speaker recognition systems and automatic language detection systems.

It is uploaded in google drive and you can download it after completing the Letter of Consent here, and sending a signed copy along with your Gmail address to ali.janalizadeh@outlook.com.

JSpeech Description

JSpeech contains up to 452,007 audio messages scraped from public groups, comprising a total of 1332 hours of converational audio data. The discussions in these groups are unstructured and are conducted with multiple speakers. JSpeech is a multilingual corpus with audio speech data from 47 different languages with over 12140 different speakers. The most notable feature of this audio data is the presence of different uncontrolled environments surrounding the speakers. This is useful in the development of speech technologies that are robust to different kinds of background noise.

The audio data has been downloaded directly from Telegram using the Telethon API in OGG format. Metadata of each file is stored in an SQLite database.

To convert the audio to the WAV format, use the following commands in whilest in the directory containing the OGG files:

apt-get install ffmpeg
ffmpeg -i audio.ogg audio.wav

In order to ensure the diversity and adequacy of the corpus, a set of 106 group chats from different backgrounds and languages were scraped from the public groups of the Telegram messaging application. Each voice message has 6 fields which are described in the Table below.

Field Name	Description
Voice_id	Unique ID assigned to each voice message
User_id	Unique ID assigned to each speaker
Fwd_from	ID of user that this message has been forwarded from
Reply_to_msg_id	ID of the message this message was replied to
Date	Time stamp of each message
Size	Size of the voice message (byte)
Duration	Duration of the voice message (second)
Chat_name	Group name

As shown in the barchart, the majority of the voice messages are made up of speech spoken in English, but there is also a noticeable amount of audio data available in other languages like Farsi, Spanish, and French.

In addition, you can find the distribution of the number of speakers available for each language.

If you want to know more information about JSpeech, you can read the paper here.

Corpus Applications

It is expected that with the availability of multilingual speech corpora with different setups of background environments will improve and boost R&D in the field of automatic speaker recognition and voice activity detection.

At Miras Technologies International we are using JSpeech to develop Speaker and Speech Recognition Systems.

Cite

Please cite the following paper in your publication if you are using JSpeech in your research:

@article{choobbastijspeech,
  title={JSPEECH: A MULTI-LINGUAL CONVERSATIONAL SPEECH CORPUS},
  author={Choobbasti, Ali Janalizadeh and Gholamian, Mohammad Erfan and Vaheb, Amir and Safavi, Saeid}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
AmountOfDataAvailabelForEachLanguage.jpeg		AmountOfDataAvailabelForEachLanguage.jpeg
LICENSE		LICENSE
NumberOfSpeakers.jpeg		NumberOfSpeakers.jpeg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JSPEECH: A Multi-Lingual Conversational Speech Corpus

JSpeech Description

Corpus Applications

Cite

About

Releases

Packages

Contributors 2

License

miras-tech/jspeech

Folders and files

Latest commit

History

Repository files navigation

JSPEECH: A Multi-Lingual Conversational Speech Corpus

JSpeech Description

Corpus Applications

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages