Skip to content

EmbeddingsDistance

Barry Stahl edited this page Jan 14, 2024 · 6 revisions

Embeddings Distance Demos

A set of XUnit tests showcasing how embeddings represent both semantic and contextual information about words and phrases.

Homonyms

Embeddings encode some of the context of the text, so words that are homonyms (ie "Ram" the animal and "Ram" the truck) do not generally have the same embedding values and will be closer in the embedding space to their respective meanings. In this example, "I'm getting new RAM for my PC" has an embedding close to "Memory", while "I'm going to pull my boat with my Ram" has an embedding close to "Truck" and "I'm getting a ram and a ewe" has an embedding close to "sheep".

Interestingly, "I'm getting a new ram", which could conceivably be referring to any of the three meanings, aligns most closely with "Memory" in the model's embedding space. This exposes a potential bias in the model, likely resulting from the training data it was provided. Since the model was trained by technical people, predominantly on documents from the Internet which are often technical, it might be more inclined to associate 'ram' with computer memory rather than a truck or a male sheep. This highlights the importance of diverse and representative training data in machine learning.

Idioms

Embeddings encode the idiomatic nature of certain expressions, so that they will have similar values to a literal statement with the same meaning. A literal form of the same phrase does not exhibit this feature. So, "He kicked the bucket" is close to "He died" in the embeddings space, but "He kicked a bucket" is not.

Sarcasm

Embeddings encode the sarcastic nature of certain expressions, so, that they will have similar values to a sincere statement with the same meaning. When a statement is made sarcastically, it means the opposite of the literal words. For instance, saying "Great job" in a sarcastic tone actually implies criticism, not praise. In the embedding space, a sarcastically used "Great job" would ideally be closer to expressions of disappointment or criticism rather than actual compliments. However, since sarcasm often relies heavily on tone of voice and context, which can't always be captured in text, models may struggle to correctly encode these expressions. That said, when sarcasm can be detected within text, such as with the expression "Well, look who's on time", the model encodes that nuance which results in an embedding closer to the literal statement of "actually late".

Language

Embeddings encode the meaning of expressions independent of the language used, so they will have similar values to an equivalent statement in another language, though not identical values since language is also encoded. Therefore "How old are you?" has an embedding very close to "What is your age?", and just a bit further from the Spanish equivalent of "¿Cuántos años tienes?".

Antonyms

The largest cosine distances are for words/phrases that are both semantically and contextually different. Antonyms are not necessarily maximally distant from each other since they share many common traits and often are similar within the context. As a result, the embeddings for antonyms are often fairly close to each other in vector space.

Clone this wiki locally