wordsim

Preparations

Building the components requires the installation of build-essential and python-dev packages with sudo apt-get install build-essential python-dev. You must also have setuptools installed for python.

Dependencies

4lang

Install the newest version of 4lang. Notes:

downloadable pre-compiled graphs are sufficient
you don't have to modify the config files
set only the FOURLANGPATH and HUNTOOLSBINPATH environmental variable

Additional libraries

Install the newest version of:

Resources

After preparing the resources you should get the following directory structure:

wordsim  
└───resources
    ├───embeddings
    │   ├───senna
    │   │   └───combined.txt
    │   ├───huang
    │   │   └───combined.txt
    │   ├───word2vec
    │   │   └───GoogleNews-vectors-negative300.bin
    │   ├───glove
    │   │   └───glove.840B.300d.w2v
    │   ├───sympat
    │   │   └───sp_plus_embeddings_500.w2v
    │   └───paragram_300
    │       └───paragram_300_sl999.txt
    └───sim_data
        └───simlex
            └───SimLex-999.txt

Embeddings

SENNA: download and extract the package, and execute the paste hash/words.lst embeddings/embeddings.txt > combined.txt command.
Huang: download and extract the ACL2012_wordVectorsTextFile.zip file, and execute the paste vocab.txt wordVectors.txt > combined.txt command.
word2vec: download and extract the GoogleNews-vectors-negative300.bin.gz file.
GloVe: download and extract the glove.840B.300d.zip file.
SP: download and extract the sp_plus_embeddings_500.dat.gz file. Insert the 152229 500 line at the beginning of the .dat file with echo '152229 500' | cat - sp_plus_embeddings_500.dat > sp_plus_embeddings_500.w2v.
Paragram: download and extract the paragram_300_sl999.zip file.

SimLex data

SimLex: download and extract the SimLex-999.zip file.

Usage

Run python src/wordsim/regression.py configs/default.cfg to get regression on features from 6 embeddings (6 features) + wordnet metrics (4 features) + 4lang (2 features). You should get average correlation: 0.755074732764 as the result.

NOTE: wordsim requires ca. 15 GB of RAM to load all models

Citing

If you use the wordsim system in your experiments, please cite

Gábor Recski, Eszter Iklódi, Katalin Pajkossy, András Kornai: Measuring semantic similarity of words using concept networks In: Proceedings of the 1st Workshop on Representation Learning for NLP, 2016

@InProceedings{Recski:2016c,
  author    = {Recski, G\'{a}bor  and  Ikl\'{o}di, Eszter  and  Pajkossy, Katalin  and  Kornai, Andras},
  title     = {Measuring Semantic Similarity of Words Using Concept Networks},
  booktitle = {Proceedings of the 1st Workshop on Representation Learning for NLP},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {193--200}
}

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
configs		configs
scripts		scripts
src/wordsim		src/wordsim
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wordsim

Preparations

Dependencies

4lang

Additional libraries

Resources

Embeddings

SimLex data

Usage

Citing

About

Releases

Packages

Contributors 3

Languages

License

recski/wordsim

Folders and files

Latest commit

History

Repository files navigation

wordsim

Preparations

Dependencies

4lang

Additional libraries

Resources

Embeddings

SimLex data

Usage

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages