Skip to content
This repository has been archived by the owner on Nov 13, 2024. It is now read-only.

Bug fix - source field was not copied to chunks #49

Merged
merged 7 commits into from
Oct 1, 2023

Conversation

igiloh-pinecone
Copy link
Contributor

Problem

There were actually multiple problems:

  1. The Chunker didn't copy the source field from the Document to all chunks
  2. During upsert(), the source field was not pushed into metadata
  3. During query(), the source field was not popped from metadata back into a dedicated field
  4. upsert_dataframe() didn't support the source column at all

Solution

Fixed all above.
In addition - I added an explicit check whether docs contain reserved metadata fields (such as 'text'), and raise and error (+ relevant tests).

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Test Plan

Added tests for all above + beefed up KB testing

This prevented sources from appearing in the final response
The source field was actually not pushed at all into `metadata` at upsert() time, and wasn't copied to the result `KbDocWithScore` at query() time!
This test was missing on multiple levels - chunker tests, KB E2E tests
resin/knoweldge_base/knowledge_base.py Show resolved Hide resolved
tests/unit/chunker/base_test_chunker.py Show resolved Hide resolved
resin/knoweldge_base/knowledge_base.py Show resolved Hide resolved
resin/knoweldge_base/knowledge_base.py Show resolved Hide resolved
resin/knoweldge_base/knowledge_base.py Outdated Show resolved Hide resolved
@acatav
Copy link
Contributor

acatav commented Sep 26, 2023

My main question is whether we would like to make the source not optional for first release. I think practically there are two options:

  1. Allow source to be optional, and along with that make the LLM not output anything about that at all
  2. Not allow source as optional at all, and maybe add it later as a feature (for my understanding this direction won't break API) - another advantage for this option is that our system will be less complex and also less prune to bugs like we fix in this PR

Currently what we have is allowing source as optional, but we make the LLM to output empty strings. The minimal fix for that can be to simply say to the LLM in the prompt to ignore empty sources.

@igiloh-pinecone
Copy link
Contributor Author

My main question is whether we would like to make the source not optional for first release. I think practically there are two

As @miararoy pointed out, the empty string is a feature, not a bug.
The user will see it in the returned context, and realize he can set the source field

@acatav
Copy link
Contributor

acatav commented Sep 26, 2023

For demos this is fine, but the question what happens when users actually want to use it in production. If they don't have any source that make sense to present to the user, we shouldn't force the model to present a source. So what I'm trying to say that in some sense right now the source is not really optional

@igiloh-pinecone igiloh-pinecone merged commit d359098 into dev Oct 1, 2023
4 checks passed
@igiloh-pinecone igiloh-pinecone deleted the bugfix/no_source branch October 1, 2023 12:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants