In this repo you will find various datasets, scripts, etc. that will help you understand our APIs better and faster.
We formatted various openly available NLU datasets so that they can be directly used on our platform.
These datasets can be found in the datasets/nlu
folder.
There are different sub-folders for different languages.
Information regarding the datasets can be found in the table below.
To refer to the citations of these datasets, kindly see citation information.
Dataset Name | Languages | License | Desciption |
---|---|---|---|
Hard | Arabic | 93700 hotel reviews from booking.com | |
Miam | English, French, German, Italian, Spanish | CC BY-SA 4.0 | Cover a variety of domains including spontaneous speech, scripted scenarios, and joint task completion |
Ask Ubuntu | English | CC BY-SA 3.0 | 162 questions and answers from https://askubuntu.com. |
Chatbot Corpus | English | CC BY-SA 3.0 | 206 questions from a Telegram chatbot for public transport in Munich |
Web Application Corpus | English | CC BY-SA 3.0 | 89 questions and answers from https://webapps.stackexchange.com. |
Hope_edi | English, Tamil, Malayalam | CC BY-SA 4.0 | A Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube. |
Atis | English | Apache-2.0 License | word sequences with IOB slot tags and the intent label |
Snips | English | Apache-2.0 License | word sequences with IOB slot tags and the intent label |
Multilingual Task Oriented | English, Spanish | ||
It Helpdesk | English | ||
Allocine | French | MIT License | French-language dataset for sentiment analysis |
Flue | French | CC BY-SA 4.0 | FLUE is an evaluation setup for French NLP systems similar to the popular GLUE benchmark |
Facebook Post Aggression Identification | Hindi, Hinglish | CC-BY-NC-SA 4.0 | Dataset with 3-way classification between ’Overtly Aggressive (OAG)’, ’Covertly Aggressive (CAG)’ and ’Non-aggressive (NAG)’ over text data |
Ilist | Hindi, Braj Bhasha, Awadhi, Bhojpuri, Magahi | Apache-2.0 License | This datasets is introduced in a task which aimed at identifying 5 closely-related languages of Indo-Aryan language family – Hindi (also known as Khari Boli), Braj Bhasha, Awadhi, Bhojpuri and Magahi |
Dravidian Codemix HASOC 2020 | Tanglish, Manglish (Tamil and Malayalam written in Roman Scripts) | The data set has been collected from YouTube comments and Tweets. Each comment/post is annotated with offensive language label at the comment/post level. | |
Telugu News | Telugu | This dataset contains Telugu language news articles along with respective topic labels (business, editorial, entertainment, nation, sport) extracted from the daily Andhra Jyoti | |
Profanity | Turkish | Annotation follows the hierarchical tagset proposed in the Offensive Language Identification Dataset (OLID) | |
Banking77 | English | CC BY-SA 4.0 | The dataset is based on the banking domain and has 77 intents |
SMP2019 | Chinese | The dataset is based on 29 domains, including: app, email... | |
Rasa Dataset Chinese | Chinese | The dataset is based on rasa dataset translated to Chinese | |
JointDSF | Vietnamese | GNU Affero General Public License v3.0 | The dataset is based on ATIS dataset translated to Vietnamese |
Urdu Fake News | Urdu | The dataset is based on fake news detection in Urdu taken from Hugging Face | |
Malayalam News Classification | Malayalam | CC BY-SA 4.0 | The dataset is based on news classification in Malayalam language from AI4Bharat |
Marathi News Classification | Malayalam | CC BY-SA 4.0 | The dataset is based on news classification in Marathi language from AI4Bharat |
NeuralSpace does not own any rights to these datasets and these are not for commercial use. Licenses of each of these datasets will be added here soon.