Building the components requires the installation of build-essential and python-dev packages with sudo apt-get install build-essential python-dev
.
You must also have setuptools installed for python.
Install the newest version of 4lang. Notes:
- downloadable pre-compiled graphs are sufficient
- you don't have to modify the config files
- set only the
FOURLANGPATH
andHUNTOOLSBINPATH
environmental variable
Install the newest version of:
After preparing the resources you should get the following directory structure:
wordsim
└───resources
├───embeddings
│ ├───senna
│ │ └───combined.txt
│ ├───huang
│ │ └───combined.txt
│ ├───word2vec
│ │ └───GoogleNews-vectors-negative300.bin
│ ├───glove
│ │ └───glove.840B.300d.w2v
│ ├───sympat
│ │ └───sp_plus_embeddings_500.w2v
│ └───paragram_300
│ └───paragram_300_sl999.txt
└───sim_data
└───simlex
└───SimLex-999.txt
- SENNA: download and extract the package, and execute the
paste hash/words.lst embeddings/embeddings.txt > combined.txt
command. - Huang: download and extract the ACL2012_wordVectorsTextFile.zip file, and execute the
paste vocab.txt wordVectors.txt > combined.txt
command. - word2vec: download and extract the GoogleNews-vectors-negative300.bin.gz file.
- GloVe: download and extract the glove.840B.300d.zip file.
- SP: download and extract the sp_plus_embeddings_500.dat.gz file. Insert the
152229 500
line at the beginning of the .dat file withecho '152229 500' | cat - sp_plus_embeddings_500.dat > sp_plus_embeddings_500.w2v
. - Paragram: download and extract the paragram_300_sl999.zip file.
- SimLex: download and extract the SimLex-999.zip file.
Run python src/wordsim/regression.py configs/default.cfg
to get regression on features from 6 embeddings (6 features) + wordnet metrics (4 features) + 4lang (2 features). You should get average correlation: 0.755074732764
as the result.
NOTE: wordsim requires ca. 15 GB of RAM to load all models
If you use the wordsim system in your experiments, please cite
Gábor Recski, Eszter Iklódi, Katalin Pajkossy, András Kornai: Measuring semantic similarity of words using concept networks In: Proceedings of the 1st Workshop on Representation Learning for NLP, 2016
@InProceedings{Recski:2016c,
author = {Recski, G\'{a}bor and Ikl\'{o}di, Eszter and Pajkossy, Katalin and Kornai, Andras},
title = {Measuring Semantic Similarity of Words Using Concept Networks},
booktitle = {Proceedings of the 1st Workshop on Representation Learning for NLP},
year = {2016},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {193--200}
}