Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance document querying: Enable summaries and overviews for uploaded files #2214

Open
jeannotdamoiseaux opened this issue Dec 5, 2024 · 6 comments

Comments

@jeannotdamoiseaux
Copy link
Contributor

The current method of file uploading is confusing for our users. Documents are chunked, vectorized, and placed in a knowledge base, resulting in the ability to only ask questions about specific pieces of text in the document, in line with the known limitations of the "normal" RAG approach.

Users upload a document but cannot request a summary or overview, which confuses them. They wonder why they can't get a high-level view of the document they just uploaded.

To address this issue, we propose two potential solution directions:

  1. Implement advanced RAG techniques that facilitate "high-level" questions, such as GraphRAG (https://github.com/microsoft/graphrag) or a variant like LazyGraphRAG.
  2. Allow document uploading within a specific chat, where the document's content is loaded into the context window, similar to the current functionality in ChatGPT.

This issue is high on our priority list to scale from the pilot phase to full organizational implementation.

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

  1. Upload a document
  2. Attempt to ask for a summary or overview of the entire document
  3. Observe that only specific chunks of text can be queried

Any log messages given by the failure

N/A

Expected/desired behavior

Users should be able to ask for summaries or overviews of entire uploaded documents, in addition to querying specific chunks of text.

OS and Version?

N/A (Web-based application)

azd version?

N/A

Versions

N/A

@pamelafox
Copy link
Collaborator

I was asked about both #1 and #2 in a live stream yesterday. #1 is being discussed in a separate issue specifically about GraphRAG. #2 is an interesting one in terms of design. For example:

  • Would we just pass it back and forth between frontend and backend? Or store it somewhere (Blob storage) and delete it after some trigger? (The deletion part is hard to do at the right time)
  • Would we index it at all, or just send the entire document to the LLM each time? What if they've uploaded a very long document that exceeds the content length? (And just costs us alot) Do we need in-memory search, like a BM25 that runs on the server?

I'm curious what you think about those questions. I imagine we can figure out the feature, as it is indeed a popular request, but I think it'll be a fairly different code path than our current flow.

@jeannotdamoiseaux
Copy link
Contributor Author

jeannotdamoiseaux commented Dec 5, 2024

I was asked about both #1 and #2 in a live stream yesterday. #1 is being discussed in a separate issue specifically about GraphRAG. #2 is an interesting one in terms of design. For example:

  • Would we just pass it back and forth between frontend and backend? Or store it somewhere (Blob storage) and delete it after some trigger? (The deletion part is hard to do at the right time)
  • Would we index it at all, or just send the entire document to the LLM each time? What if they've uploaded a very long document that exceeds the content length? (And just costs us alot) Do we need in-memory search, like a BM25 that runs on the server?

I'm curious what you think about those questions. I imagine we can figure out the feature, as it is indeed a popular request, but I think it'll be a fairly different code path than our current flow.

Thank you for the insights. My inclination is to send the entire document to the LLM each time the user interacts with it. Here’s my reasoning:

  1. User Expectations: Users expect full access to their uploaded document for summaries or insights. Sending the entire document avoids the confusion caused by chunking.
  2. Token Limits: With context windows like 128k tokens, documents must be very large to exceed the limit. In such cases, users typically intuitively understand the need to split files, making this approach more intuitive than restricted access as a result of chunking.
  3. Cost Justification: While costs may be higher, they’re justified for us by the importance of this feature and the improved user experience.
  4. Precedent: ChatGPT and similar tools seem to handle documents this way (see example).

image

Regarding GraphRAG, the primary drawback lies in the high upfront costs of constructing the graph, a challenge that LazyGraphRAG, as referenced in #1928, may help mitigate.

@pamelafox
Copy link
Collaborator

pamelafox commented Dec 5, 2024

That makes sense. That's easier to implement, it means we only need to run part of the ingestion pipeline (Extraction).

Here's how ChatGPT seems to do it:

  • User attaches a file in the UI, clicking the paperclip in the text field.

  • This immediately kicks off a POST request to an endpoint specifically for uploading the file. That streams back the upload status:
    https://chatgpt.com/backend-api/files/process_upload_stream

    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.processing.started","message":"Start processing file: file-U54F9rZAZdnGBdsVoJPFAb","progress":0.0,"extra":null}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.processing.file_ready","message":"File file-U54F9rZAZdnGBdsVoJPFAb is ready to download","progress":20.0,"extra":null}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.indexing.in_progress","message":"","progress":null,"extra":{"retrieval_tenant_id":"user-Gtf1UhcL6iKu2oLuRoCNLu2q","retrieval_lro_id":"71ea530a3d0646e79e166d632956847f+0"}}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.indexing.in_progress","message":"","progress":40.0,"extra":{"total_tokens":220,"retrieval_tenant_id":"user-Gtf1UhcL6iKu2oLuRoCNLu2q","retrieval_lro_id":"71ea530a3d0646e79e166d632956847f+0"}}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.indexing.in_progress","message":"","progress":60.0,"extra":{"total_tokens":220,"retrieval_tenant_id":"user-Gtf1UhcL6iKu2oLuRoCNLu2q","retrieval_lro_id":"71ea530a3d0646e79e166d632956847f+0"}}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.indexing.in_progress","message":"","progress":80.0,"extra":{"retrieval_tenant_id":"user-Gtf1UhcL6iKu2oLuRoCNLu2q","retrieval_lro_id":"71ea530a3d0646e79e166d632956847f+0"}}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.indexing.completed","message":"","progress":null,"extra":{"retrieval_tenant_id":"user-Gtf1UhcL6iKu2oLuRoCNLu2q","retrieval_lro_id":"71ea530a3d0646e79e166d632956847f+0"}}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.indexing.done","message":"[DONE]","progress":null,"extra":null}
    {"file_id":"file-U54F9rZAZdnGBdsVoJPFAb","event":"file.processing.completed","message":"Succeeded processing file file-U54F9rZAZdnGBdsVoJPFAb","progress":100.0,"extra":null}
    
  • When the user asks a question, it includes the ID of the uploaded file:

    {"action":"next","messages":[{"id":"aaa2287c-8cc8-413e-8126-d7fb0e76a2ae","author":{"role":"user"},"content":{"content_type":"text","parts":["whats the total?"]},"metadata":{"attachments":[{"id":"file-7TYu7F6Nd5JPgb1es68aef","size":184852,"name":"receipt_retracted.pdf","mime_type":"application/pdf"}],"serialization_metadata":{"custom_symbol_offsets":[]}},"create_time":1733431367.645}],"conversation_id":"67520ff7-09b0-8012-99ad-83b889d8e225","parent_message_id":"3a165038-ee48-4bb9-91ca-ae4113ca455d","model":"auto","timezone_offset_min":480,"timezone":"America/Los_Angeles","suggestions":[],"history_and_training_disabled":false,"conversation_mode":{"kind":"primary_assistant"},"force_paragen":false,"force_paragen_model_slug":"","force_rate_limit":false,"reset_rate_limits":false,"websocket_request_id":"c9ea1c3e-3125-4803-b5e1-06aed0168d0c","system_hints":[],"supported_encodings":["v1"],"conversation_origin":null,"client_contextual_info":{"is_dark_mode":false,"time_since_loaded":228,"page_height":415,"page_width":1912,"pixel_ratio":1,"screen_height":1200,"screen_width":1920},"paragen_stream_type_override":null,"paragen_cot_summary_display_override":"allow","supports_buffering":true}
    

    So notably, ChatGPT doesn't seem to actually send the full data- it looks like they're storing it in a file store somewhere.

  • The user can also ask follow-up questions, and it seems to know about file IDs from earlier, despite it not being in the JSON itself. I assume it's looking up previous messages and including them, to reduce size of messages sent to server:

    {"action":"next","messages":[{"id":"aaa24356-b80a-4183-9da1-786abfd31441","author":{"role":"user"},"content":{"content_type":"text","parts":["was there shipping?"]},"metadata":{"serialization_metadata":{"custom_symbol_offsets":[]}},"create_time":1733431468.498}],"conversation_id":"67520ff7-09b0-8012-99ad-83b889d8e225","parent_message_id":"1b94d35e-1b12-4e4f-92cd-4bd05415f7ac","model":"auto","timezone_offset_min":480,"timezone":"America/Los_Angeles","suggestions":[],"history_and_training_disabled":false,"conversation_mode":{"kind":"primary_assistant"},"force_paragen":false,"force_paragen_model_slug":"","force_rate_limit":false,"reset_rate_limits":false,"websocket_request_id":"bcb0d3eb-6ce6-46fc-abe5-029ec89959e0","system_hints":[],"supported_encodings":["v1"],"conversation_origin":null,"client_contextual_info":{"is_dark_mode":false,"time_since_loaded":329,"page_height":415,"page_width":1912,"pixel_ratio":1,"screen_height":1200,"screen_width":1920},"paragen_stream_type_override":null,"paragen_cot_summary_display_override":"allow","supports_buffering":true}
    

I think we could potentially send the full data over the wire, for simplicity, but that will incur higher costs, and we'd need to decide whether to do base-64 data URIs inside JSON or do a multi-part with both JSON and an attachment.

@pamelafox
Copy link
Collaborator

Addendum: Another option is to store the document text inside localStorage, and have the client fetch it from there for follow-up questions. That avoid the need for a cloud DB. We could expire it after a while (I have a package called lscache that does time-based expiry for localStorage). That still has the drawback of increasing the size of data sent over the wire, however.

@jeannotdamoiseaux
Copy link
Contributor Author

@pamelafox - I’m really impressed with how you broke down ChatGPT’s file upload process—it clarified a lot, especially since I initially thought the full file was loaded into the context window. That said, I’m curious why we’d send file content as Base64 data instead of using our prepdocs functions, which could make the content cleaner and more structured for processing. Wouldn’t this approach also help optimize costs and user experience while still keeping things manageable?

@pamelafox
Copy link
Collaborator

I'd still want to use prepdocs for file understanding (the parsing step), but not for indexing, since we wouldn't be doing that step.

For example, if I was copying the ChatGPT approach entirely:

  • /upload_doc_for_chat: Accepts a binary file, sends it through the appropriate fileprocessor's in prepdocs, and stores the full uploaded file in a data store (e.g. Blob). Returns the ID of the stored file, which gets appended to subsequent chat requests in that session.
  • /chat: When it sees a file_id, it pulls that text back of out the data store and uses it for the sources. Or it could concatenate to sources, if the expectation is to combine with RAG.
  • When the user clears the chat, the file ID is not sent any longer.

If I was going to use local storage:

  • /upload_doc_for_chat: Accepts a binary file, sends it through the appropriate fileprocessor's in prepdocs and returns back the full extracted text string. That is stored in localStorage with an ID, and the UI remembers the current file ID.
  • When a user makes a new chat request, it fetches the file from localstorage, and sends it in the JSON to /chat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants