Skip to content

a High Agreement Multi-lingual Outlier Detection dataset

License

Notifications You must be signed in to change notification settings

lexicalcomputing/hamod

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HAMOD

a High Agreement Multi-lingual Outlier Detection dataset

Data

This site hosts a multi-lingual dataset comprising manually prepared data suitable for carrying out the outlier detection exercise. Outlier detection is a task of selecting an outlier, a word that does not fit to the set of given words based on some (typically semantic) criteria.

Examples

  • blue, red, green, yellow, orange, black, brown, white, table. Obviously, the last word is the outlier: all the others are names of colours.
  • bricklayer, lawyer, shop assistant, gentleman, waitress, metheorologist. Gentleman is the outlier word because it is not a job.

Dataset format

The dataset consists of plain text files, each containing 8 lines of the in-domain words, a blank line, and 8 lines of outlier words. Each file therefore represents 8 exercises (by taking the 8 in-domain words and adding one out of the 8 outliers) for sets of 9 words, or many more exercises for shorter sets (such as 64 exercises for sets of 8 words, by choosing one out of the 8 outliers and removing one out of the 8 in-domain words). An example file for English covering the set of birds:

swan
duck
seagull
eagle
dove
crow
stork
goose

monkey
salmon
grasshopper
fly
egg
plane
woman
cliff

Motivation

The outlier detection task features very high agreement (typically over 90%) among human annotators and can be used e.g. for the evaluation of distributional thesauri. Please read the papers referenced below for all the details.

Languages

At the moment the dataset consists of the following languages:

  • Czech
  • German
  • English
  • Estonian
  • French
  • Italian
  • Slovak

If you would like to collaborate with us on adding a new language, please use the contact below.

Authors

This dataset was developed by Lexical Computing, particularly by (in alphabetical order) Michal Cukr, Ondřej Herman, Miloš Jakubíček, Vojtěch Kovář, Emma Romani and Pavel Rychlý.

Contact

Please use inquiries@sketchengine.eu for any questions or requests.

License

Creative Commons License

The dataset is licensed under the CC-BY-SA 4.0 license. Attribution in any research context shall be carried out by properly citing the papers referenced below. We would appreciate if you let us know about any derived work.

How to cite

Please cite:

  • Romani, E. (2022). Building A Multilingual Outlier Detection Dataset For The Evaluation Of Distributional Thesauri And Word Embeddings. Master's thesis, University of Pavia. PDF

    BibTex:

    @mastersthesis{hamod_thesis,
      title={Building A Multilingual Outlier Detection Dataset For The Evaluation Of Distributional Thesauri And Word Embeddings},
      author={Emma, Romani},
      school={The University of Pavia},
      year={2022}
    }
    
  • Jakubíček, M., Romani, E., Rychlý, P., & Herman, O. (2021). Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset. In RASLAN 2021 Recent Advances in Slavonic Natural Language Processing, 177. PDF

    BibTex:

    @inproceedings{hamod,
      title={Development of HAMOD: a High Agreement Multi-lingual Outlier Detection dataset},
      author={Jakubíček, Miloš and Romani, Emma and Rychlý, Pavel and Herman, Ondřej},
      booktitle={Proceedings of the Fifteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2021},
      year={2021},
      pages={177--183},
      publisher={Tribun EU}
    }
    
  • Rychlý, P. (2019). Evaluation of Czech Distributional Thesauri. In RASLAN 2019 Recent Advances in Slavonic Natural Language Processing, 137. PDF

    BibTex:

    @inproceedings{thesaurievaluation,
      title={Evaluation of Czech Distributional Thesauri},
      author={Rychlý, Pavel},
      booktitle={Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019},
      pages={137--142},
      year={2019},
      publisher={Tribun EU}
    }