Skip to content
This repository has been archived by the owner on Nov 13, 2024. It is now read-only.

Use BOW values projected to dense as stub embeddings for test #80

Merged
merged 10 commits into from
Oct 22, 2023

Conversation

acatav
Copy link
Contributor

@acatav acatav commented Oct 17, 2023

Problem

Current stub encoder is flaky and sometimes puts very different texts very close
Also, it's not putting similar text together, only the same text on itself

Solution

Implement stub dense encoder that project bag of words sparse vector to a dense vector

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

Test Plan

Part of the tests framework itself

Copy link
Contributor

@igiloh-pinecone igiloh-pinecone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about the memory usage of the random_matrix.
Isn't there some simpler solution that doesn't require pre-allocating such a huge matrix?

self.dimension = dimension
rng = np.random.default_rng(seed)
self.random_matrix = rng.standard_normal((self.input_dim, self.dimension))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait - isn't that a HUGE shape??
It's a dense matrix, isn't it? Won't it take a lot of memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the need in actually holding this matrix, for the default dimensions it's few MB but for output dim 1536 it can become larger than we would like. Anyway, we not even holding a matrix any more

Copy link
Contributor

@igiloh-pinecone igiloh-pinecone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG(engouh)TM

@acatav acatav merged commit f34f5d2 into dev Oct 22, 2023
9 checks passed
@acatav acatav deleted the stub-bow-embedding branch October 22, 2023 10:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants