Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similarity metrices #19

Open
YeonwooSung opened this issue May 11, 2021 · 0 comments
Open

Similarity metrices #19

YeonwooSung opened this issue May 11, 2021 · 0 comments

Comments

@YeonwooSung
Copy link
Owner

  1. Euclidean distance

Euclidean distance (often called L2 norm) is the most intuitive of the metrics.

1_X5MMsxJauDXDh3RKnJKWLQ.png

However, the Euclidean distance only considers the size not orientation (direction of the vectors). To overcome this issue, we could adapt either dot product or cosine similarity.

  1. Dot product

One drawback of Euclidean distance is the lack of orientation considered in the calculation — it is based solely on magnitude. And this is where we can use our other two metrics. The first of those is the dot product.

The dot product considers direction (orientation) and also scales with vector magnitude.

We care about orientation because similar meaning (as we will often find) can be represented by the direction of the vector — not necessarily the magnitude of it.

For example, we may find that our vector's magnitude correlates with the frequency of a word that it represents in our dataset. Now, the word hi means the same as hello, and this may not be represented if our training data contained the word hi 1000 times and hello just twice.

So, vectors' orientation is often seen as being just as important (if not more so) as distance.

The dot product is calculated using:

1_928VGLCFwRwaFnIu-Ptxfg.png

The dot product considers the angle between vectors, where the angle is ~0, the cosθ component of the formula equals ~1. If the angle is nearer to 180 (orthogonal/perpendicular), the cosθ component equals ~0.

Therefore, the cosθ component increases the result where there is less of an angle between the two vectors. So, a higher dot-product correlates with higher orientation.

Clearly, the dot product calculation is straightforward (the simplest of the three) — and this gives us benefits in terms of computation time.

However, there is one drawback. It is not normalized — meaning larger vectors will tend to score higher dot products, despite being less similar.

So, in reality, the dot-product is used to identify the general orientation of two vectors — because:

  1. Two vectors that point in a similar direction return a positive dot-product.

  2. Two perpendicular vectors return a dot-product of zero.

  3. Vectors that point in opposing directions return a negative dot-product.

  1. Cosine similarity

Cosine similarity considers vector orientation, independent of vector magnitude.

1_BJq5ZVsO4rpYTsmV9pIGUQ.png

The first thing we should be aware of in this formula is that the numerator is, in fact, the dot product — which considers both magnitude and direction.

In the denominator, we have the strange double vertical bars — these mean ‘the length of’. So, we have the length of u multiplied by the length of v. The length, of course, considers magnitude.

When we take a function that considers both magnitude and direction and divide that by a function that considers just magnitude — those two magnitudes cancel out, leaving us with a function that considers direction independent of magnitude.

We can think of cosine similarity as a normalized dot product! And it clearly works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant