Skip to content

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces (EMNLP 2022)

Notifications You must be signed in to change notification settings

kellymarchisio/isovec

Repository files navigation

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

This is an implementation of the experiments and combination system presented in:

If you use this software for academic research, please cite the paper above.

Requirements

  • python3
  • pytorch
  • sklearn
  • scipy
  • numpy
  • indic-nlp-library
  • torchtext

Setup

  • Download third party packages: cd third_party && sh get_third_party.sh && cd ..
    • Note: If you're on Mac with an M1 chip, word2vec might not build. You can fix this by changing -march=native to -mcpu=apple-m1 in word2vec's makefile, and subbing in getc_unlocked and putc_unlocked for fgetc_unlocked/fputc_unlocked. You'll also need to use gshuf instead of shuf within src/train.py.
  • Download and make data: cd data && sh make_data.sh
  • Download and make train/dev/test dictionaries: cd data/dicts && sh create_dicts.sh

Usage

To reproduce Table 1 in the paper (Baselines), run:

  • sh baseline.sh $system $lang $seed
    • For instance, run sh baseline.sh w2v uk for offical word2vec trained on Ukrainian.
    • system choices: {isovec, w2v}
    • lang choices: {uk, bn, ta, en}
  • After you train English and Ukrainian baseline w2v spaces, for instance, you can map them and evaluate the dictionary precision with: sh map-and-eval.sh baseline w2v uk en dev
    • Results will be in exps/baseline/w2v/uk-en/*out

To run IsoVec in reference to a fixed embedding space (main experiments):

  • Example Goal: Train a Ukrainian embedding space with RSIM-U, in reference to a fixed English space.
  • Step 1: Train the fixed English space with sh baseline.sh isovec en
  • Step 2: Train the Ukrainian space with: sh run-isovec.sh rsim-u uk en
    • Choices of Isovec training algorithm are l2, proc-l2, proc-l2-init, rsim, rsim-init, rsim-u, evs-u for L2, Proc-L2, Proc-L2+Init, RSIM, RSIM-U, and EVS-U as detailed in Section 4.3 and 4.4 of the paper.
  • Step 3: Map & Evaluate the spaces with: sh map-and-eval.sh isovec rsim-u uk en dev

About

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces (EMNLP 2022)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published