Query by metadata #911

ahuang11 · 2025-01-02T18:17:47Z

Supersedes #883 based on:

I guess the obvious reason is that the metadata is currently only used for filtering and the embeddings are only computed for the contents? In which case I guess my question is whether we should automatically include the embeddings in the metadata somehow.

I suppose we could have another database column containing a joined text of everything, and have a toggle to query_metadata = True/False

Please augment your response with the following context if relevant:
- Receipt Invoice number 1234 Receipt number Date paid 1234 December 15, 2024 Payment method American Express - 1234 OpenAI, LLC 548 Market Street PMB 97273 San Francisco, California 1234 United States ar@openai.com Bill to Me Address United State $22.07 paid on 2024 ... (Relevance: 0.2 - Metadata: {'text': 'Filename: Receipt-2813-2096 '}

lumen/ai/tools.py

codecov · 2025-01-02T18:21:57Z

Codecov Report

Attention: Patch coverage is 77.63158% with 17 lines in your changes missing coverage. Please review.

Project coverage is 58.46%. Comparing base (9efa066) to head (512c229).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
lumen/ai/tools.py	0.00%	7 Missing ⚠️
lumen/ai/controls.py	0.00%	5 Missing ⚠️
lumen/ai/vector_store.py	91.37%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #911      +/-   ##
==========================================
+ Coverage   58.44%   58.46%   +0.01%     
==========================================
  Files         109      109              
  Lines       13868    13884      +16     
==========================================
+ Hits         8105     8117      +12     
- Misses       5763     5767       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

philippjfr · 2025-01-02T20:02:36Z

Sorry I wasn't clear before. I wasn't saying I prefer this approach, I just wanted to weigh the pros and cons of both approaches.

philippjfr · 2025-01-02T20:06:26Z

Well that isn't quite true, I definitely prefer treating the filename as metadata rather than having a distinct row for it, but as we are discovering it does have drawbacks.

ahuang11 · 2025-01-02T20:08:11Z

I personally prefer this approach.

Perhaps it could be more targeted with query_with_metadata = ["<key>", "<key2>"] so like query_with_metadata = ["filename"]

ahuang11 · 2025-01-02T20:12:17Z

Alternatively, could go back to the old approach, but store metadata as a separate table, then joined on the main table.

ahuang11 · 2025-01-02T20:15:18Z

Looking at ChromaDB, they have explicit filters for metadata filtering
https://docs.trychroma.com/docs/querying-collections/metadata-filtering

philippjfr · 2025-01-03T13:17:55Z

My main question is whether it's really necessary to store a text and metadata column and compute the additional embeddings. Does always including the metadata in the embedding result in any appreciable performance degredation?

ahuang11 · 2025-01-03T16:57:56Z

Does always including the metadata in the embedding result in any appreciable performance degredation?

Do you mean rather than text | metadata | text_and_metadata, simply keep text_and_metadata | metadata, and then do postprocessing on text_and_metadata on retrieval, e.g. split by some delimiter like metadata tags ||||>

In this PR, we only look up by text_and_metadata, but return only text

philippjfr · 2025-01-03T21:34:09Z

lumen/ai/vector_store.py

+            f"({key}: {self._format_metadata_value(value)})"
+            for key, value in metadata.items()
+        ]
+        text_and_metadata = " ".join(metadata_items)


So it's just metadata, not text_and_metadata?

lumen/ai/vector_store.py

ahuang11 · 2025-01-06T18:02:09Z

Seems to work decently now! I also bumped up the chunk size for better context.

Edit: there seems to be duplication; let me see if I can fix

ahuang11 · 2025-01-06T19:08:16Z

Okay this is now ready.

I also added functionality where if user uploads the same file (perhaps with modified contents), it'll overwrite the existing document.

philippjfr · 2025-01-07T13:52:08Z

Looks good!

query by filename

8e56663

ahuang11 requested a review from philippjfr January 2, 2025 18:17

ahuang11 mentioned this pull request Jan 2, 2025

Query by filename #883

Closed

rename

0f9d5f1

ahuang11 commented Jan 2, 2025

View reviewed changes

lumen/ai/tools.py Outdated Show resolved Hide resolved

Update lumen/ai/tools.py

252f144

philippjfr reviewed Jan 3, 2025

View reviewed changes

lumen/ai/vector_store.py Outdated Show resolved Hide resolved

ahuang11 mentioned this pull request Jan 3, 2025

LLM decides metadata #918

Closed

ahuang11 and others added 2 commits January 6, 2025 08:42

Merge branch 'main' into query_by_metadata

112e654

use both text and metadata embedding * simplify

392aff5

ahuang11 added 6 commits January 6, 2025 10:39

Fix filters

a054f17

fix duplicates and overwrite if duplicate

9074684

apply fixes

700b8e8

deduplicate doc sources

839ce70

match by filename

a17cf89

fix test name

512c229

ahuang11 requested a review from philippjfr January 6, 2025 19:07

philippjfr approved these changes Jan 7, 2025

View reviewed changes

philippjfr merged commit fa0ede5 into main Jan 7, 2025
12 checks passed

philippjfr deleted the query_by_metadata branch January 7, 2025 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query by metadata #911

Query by metadata #911

ahuang11 commented Jan 2, 2025

codecov bot commented Jan 2, 2025 •

edited

Loading

philippjfr commented Jan 2, 2025

philippjfr commented Jan 2, 2025

ahuang11 commented Jan 2, 2025

ahuang11 commented Jan 2, 2025

ahuang11 commented Jan 2, 2025

philippjfr commented Jan 3, 2025

ahuang11 commented Jan 3, 2025 •

edited

Loading

philippjfr Jan 3, 2025

ahuang11 commented Jan 6, 2025 •

edited

Loading

ahuang11 commented Jan 6, 2025

philippjfr commented Jan 7, 2025

Query by metadata #911

Query by metadata #911

Conversation

ahuang11 commented Jan 2, 2025

codecov bot commented Jan 2, 2025 • edited Loading

Codecov Report

philippjfr commented Jan 2, 2025

philippjfr commented Jan 2, 2025

ahuang11 commented Jan 2, 2025

ahuang11 commented Jan 2, 2025

ahuang11 commented Jan 2, 2025

philippjfr commented Jan 3, 2025

ahuang11 commented Jan 3, 2025 • edited Loading

philippjfr Jan 3, 2025

Choose a reason for hiding this comment

ahuang11 commented Jan 6, 2025 • edited Loading

ahuang11 commented Jan 6, 2025

philippjfr commented Jan 7, 2025

codecov bot commented Jan 2, 2025 •

edited

Loading

ahuang11 commented Jan 3, 2025 •

edited

Loading

ahuang11 commented Jan 6, 2025 •

edited

Loading