Embeddings - subword2vec

subword2vec is the code repository for training word embeddings enriched with sub-word knowledge like character ngrams, lemma, morphological tags and phonemes. This library was used in our upcoming paper Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations (to appear in EMNLP-2018). subword2vec is based on the fastText library and prop2vec library.

Requirements

gcc-4.6.3 or newer (for compiling)

Input format

Each word is represented with its sub-word information. For instance, a Hindi word नदी (river) is represented as w:नदी~l:नद~m:N+Fem+Dir+Sg~ipa:nədiː. An example input file is shown in the \example directory.

Command

cd Embeddings
make
./fasttext skipgram -input example/sample_input.txt -output example/sample_output -lr 0.025 -dim 100 -t 1e-3 -props w+l+m -minCount 2 -ws 3 -bucket 2000000 -lemmaoutput example/sample_lemma_output -morphoutput example/sample_morph_output

This command will train embeddings by averaging the sub_word units: charngrams (w:), lemma (l:), morph(m:). One can provide different combinations in the -props field depending on one's requirements. If one needs a combination of charngrams + morph, one should provide -props w+m. The arguments -lemmaoutput and -morphoutput would output the embeddings of the each sub-word unit.

Using pretrained sub-word embeddings

To use pretrained embeddings for initializing vectors for the sub-words, the pretrained embeddings should have the following format:

102345 100
Punc 1.4163 1.697 -0.95646 0.4587 1.2924 0 ...

where 102345 denotes the number of unique sub-words 100 denotes the embedding size. Then run the following:

./fasttext skipgram -input example/sample_input.txt -output  example/sample_output_pretrained_with_morph -ws 3 -t 1e-3 -minCount 2 -lr 0.025 -bucket 2000000 -props w+l+m -pretrainedVectors example/morph_output.vec

Best embeddings (in reference to the paper)

The embeddings which gave best performance in the NER task used for our work here Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations (to appear in EMNLP-2018), are made available in this folder: \embeddings_released.

References

If you make use of this software for research purposes, we will appreciate citing the following:

@InProceedings{D18-1366,
  author = 	"Chaudhary, Aditi
		and Zhou, Chunting
		and Levin, Lori
		and Neubig, Graham
		and Mortensen, David R.
		and Carbonell, Jaime",
  title = 	"Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations",
  booktitle = 	"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"3285--3295",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/D18-1366"
}

Contact

For any issues, please feel free to reach out to aschaudh@andrew.cmu.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
.circleci		.circleci
embeddings_released		embeddings_released
example		example
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embeddings - subword2vec

Requirements

Input format

Command

Using pretrained sub-word embeddings

Best embeddings (in reference to the paper)

References

Contact

About

Releases

Packages

Languages

Aditi138/Embeddings

Folders and files

Latest commit

History

Repository files navigation

Embeddings - subword2vec

Requirements

Input format

Command

Using pretrained sub-word embeddings

Best embeddings (in reference to the paper)

References

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages