Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query by metadata #911

Merged
merged 11 commits into from
Jan 7, 2025
Merged

Query by metadata #911

merged 11 commits into from
Jan 7, 2025

Conversation

ahuang11
Copy link
Contributor

@ahuang11 ahuang11 commented Jan 2, 2025

Supersedes #883 based on:

I guess the obvious reason is that the metadata is currently only used for filtering and the embeddings are only computed for the contents? In which case I guess my question is whether we should automatically include the embeddings in the metadata somehow.

I suppose we could have another database column containing a joined text of everything, and have a toggle to query_metadata = True/False

Please augment your response with the following context if relevant:
- Receipt Invoice number 1234 Receipt number Date paid 1234 December 15, 2024 Payment method American Express - 1234 OpenAI, LLC 548 Market Street PMB 97273 San Francisco, California 1234 United States ar@openai.com Bill to Me Address United State $22.07 paid on 2024 ... (Relevance: 0.2 - Metadata: {'text': 'Filename: Receipt-2813-2096 '}

@ahuang11 ahuang11 requested a review from philippjfr January 2, 2025 18:17
@ahuang11 ahuang11 mentioned this pull request Jan 2, 2025
lumen/ai/tools.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 2, 2025

Codecov Report

Attention: Patch coverage is 77.63158% with 17 lines in your changes missing coverage. Please review.

Project coverage is 58.46%. Comparing base (9efa066) to head (512c229).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
lumen/ai/tools.py 0.00% 7 Missing ⚠️
lumen/ai/controls.py 0.00% 5 Missing ⚠️
lumen/ai/vector_store.py 91.37% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #911      +/-   ##
==========================================
+ Coverage   58.44%   58.46%   +0.01%     
==========================================
  Files         109      109              
  Lines       13868    13884      +16     
==========================================
+ Hits         8105     8117      +12     
- Misses       5763     5767       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@philippjfr
Copy link
Member

Sorry I wasn't clear before. I wasn't saying I prefer this approach, I just wanted to weigh the pros and cons of both approaches.

@philippjfr
Copy link
Member

Well that isn't quite true, I definitely prefer treating the filename as metadata rather than having a distinct row for it, but as we are discovering it does have drawbacks.

@ahuang11
Copy link
Contributor Author

ahuang11 commented Jan 2, 2025

I personally prefer this approach.

Perhaps it could be more targeted with query_with_metadata = ["<key>", "<key2>"] so like query_with_metadata = ["filename"]

@ahuang11
Copy link
Contributor Author

ahuang11 commented Jan 2, 2025

Alternatively, could go back to the old approach, but store metadata as a separate table, then joined on the main table.

@ahuang11
Copy link
Contributor Author

ahuang11 commented Jan 2, 2025

Looking at ChromaDB, they have explicit filters for metadata filtering
https://docs.trychroma.com/docs/querying-collections/metadata-filtering

@philippjfr
Copy link
Member

My main question is whether it's really necessary to store a text and metadata column and compute the additional embeddings. Does always including the metadata in the embedding result in any appreciable performance degredation?

@ahuang11
Copy link
Contributor Author

ahuang11 commented Jan 3, 2025

Does always including the metadata in the embedding result in any appreciable performance degredation?

Do you mean rather than text | metadata | text_and_metadata, simply keep text_and_metadata | metadata, and then do postprocessing on text_and_metadata on retrieval, e.g. split by some delimiter like metadata tags ||||>

In this PR, we only look up by text_and_metadata, but return only text

f"({key}: {self._format_metadata_value(value)})"
for key, value in metadata.items()
]
text_and_metadata = " ".join(metadata_items)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's just metadata, not text_and_metadata?

lumen/ai/vector_store.py Outdated Show resolved Hide resolved
@ahuang11 ahuang11 mentioned this pull request Jan 3, 2025
@ahuang11
Copy link
Contributor Author

ahuang11 commented Jan 6, 2025

Seems to work decently now! I also bumped up the chunk size for better context.

image

Edit: there seems to be duplication; let me see if I can fix
image

@ahuang11 ahuang11 requested a review from philippjfr January 6, 2025 19:07
@ahuang11
Copy link
Contributor Author

ahuang11 commented Jan 6, 2025

Okay this is now ready.

I also added functionality where if user uploads the same file (perhaps with modified contents), it'll overwrite the existing document.

@philippjfr
Copy link
Member

Looks good!

@philippjfr philippjfr merged commit fa0ede5 into main Jan 7, 2025
12 checks passed
@philippjfr philippjfr deleted the query_by_metadata branch January 7, 2025 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants