Skip to content

Latest commit

 

History

History
42 lines (31 loc) · 2.7 KB

README.md

File metadata and controls

42 lines (31 loc) · 2.7 KB

In a Vector Space of La Mancha, whose Position I Do Wish to Recall... 🍇

espacio

This repository presents a comprehensive evaluation of word embeddings in Spanish with paper and code. We trained word vectors over La Solana using the library fasttext. You can visualize them following this link to Tensorflow Embeddings Projector. Try searching for 'la_casota', 'ofrecimiento' or 'moje'. Do you have any idea what these words mean in La Solana?

Summary and links for lasolana-embeddings:

Corpus Size Algorithm Vectors Vec-Dim
Collected corpus 4.1M FastText 29682 150

La Solana FastText embeddings

Links to the embeddings (#dimensions=150, #vectors=29,682):

Corpus

All digitized corpus about La Solana that we have access to:

Corpus Size: over 4 million words Preprocessing: Explained in training_eval notebook.

Algorithm

Implementation: FastText with Skipgram and no sub-words

Hyperparameters

  • min subword-ngram = 0
  • max subword-ngram = 0
  • neg = 5
  • ws = 5
  • epochs = 5
  • dim = 150
  • all other parameters set as default