Skip to content

Support for fastText, word2vec, and text embeddings

Compare
Choose a tag to compare
@danieldk danieldk released this 10 Sep 09:00

The largest change is this release is support for reading fastText, word2vec, and text embeddings, in addition to finalfusion embeddings.

  • Add support for reading fastText (Embeddings.read_fasttext()), text (Embeddings.read_text()), textdims (Embeddings.read_text()), and word2vec (Embeddings.read_fasttext()) formats.
  • Each of these newly-supported formats provides a keyword argument lossy. If set, the embeddings will be read lossily, permitting invalid UTF-8 in words.
  • Add the embedding_similarity method, which looks up words that are similar to a given embedding. The method for traditional word-based lookups has been renamed from similarity to word_similarity.
  • Iteration over embeddings returned tuples (word, embedding) in previous releases. Now instances of the Embedding class are returned, which provide word, embedding, and norm properties. norm is the embedding norm before normalization of an embedding using its l2 norm.
  • Add support for memory mapping quantized embedding matrices.
  • Add the ngram_indices and subword_indices to the Vocab class. These methods return the subword indices for a given word, which can be used to retrieve the subword embeddings individually. The ngram_indices methods returns each subword with its index, whereas subword_indices only returns the indices.
  • Update to pyo3 0.8.