Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Seems markdown table content inside .md file is not available in knowledge base and can not be known by AI #4200

Open
xqliu opened this issue Sep 29, 2024 · 8 comments
Labels
🐛 Bug Something isn't working | 缺陷 knowledge base knowledge base / RAG / file chunk

Comments

@xqliu
Copy link

xqliu commented Sep 29, 2024

📦 Environment

Vercel

📌 Version

Latest

💻 Operating System

macOS

🌐 Browser

Chrome

🐛 Bug Description

可以看到相关内容已经分块,答案就在 markdown table 中,但是问 AI 说没有这方面的信息。

image

📷 Recurrence Steps

  1. 创建一个知识库,上载一个 markdown 文件,包含一个表格
  2. 问表格中相关知识,回复不对

🚦 Expected Behavior

markdown 表格中的内容在知识库中,也应该能够被索引到,被问到的时候,也应该回答到。

📝 Additional Information

No response

@xqliu xqliu added the 🐛 Bug Something isn't working | 缺陷 label Sep 29, 2024
@lobehubbot
Copy link
Member

👀 @xqliu

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible.
Please make sure you have given us as much context as possible.
非常感谢您提交 issue。我们会尽快调查此事,并尽快回复您。 请确保您已经提供了尽可能多的背景信息。

Copy link

dosubot bot commented Sep 29, 2024

The issue you're encountering where markdown table content inside .md files is not available in the knowledge base and cannot be known by the AI might be due to a limitation or bug in the current implementation.

There is existing functionality for extracting and indexing table content from markdown files. The system can parse and extract table content, including its representation in HTML, as demonstrated in the provided JSON content [1][2]. However, it seems that this functionality might not be working as expected in your case.

To address this issue, you might want to check the following:

  1. Ensure Proper Parsing: Verify that the markdown file is being parsed correctly and that the table content is being extracted and indexed. You can check the logs or the indexed data to confirm this.

  2. Check for Known Issues: There is a known issue documented regarding the handling of markdown files in the knowledge base, which might be related to your problem [3].

  3. Review the Codebase: The part of the codebase responsible for parsing and indexing uploaded files involves components like PostgreSQL, PGVector, S3-compatible Object Storage, OpenAI Embedding, and optionally Unstructured.io. Specifically, the src/store/file/slices/chat/action.ts file contains the logic for handling file uploads and processing [4][5].

If the issue persists, it might be necessary to further investigate the specific implementation details or consider reporting this as a bug for further assistance from the development team.

To continue talking to Dosu, mention @dosu.

@arvinxx
Copy link
Contributor

arvinxx commented Sep 29, 2024

这个你点开引用源看下检索到的 chunk,我估计问题是引用源里面的 chunk 没有包含这个表格?

RAG 系统的问题很多时候就在此,问的问题召回不到相关内容的 chunk ,就没法回答相关的问题了。所以 #4201#4005 类似的全文检索需求就很有必要

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


You can click on the reference source and look at the retrieved chunk. I guess the problem is that the chunk in the reference source does not contain this table?

This is often the problem with the RAG system. If you ask questions that cannot recall chunks of relevant content, you will not be able to answer relevant questions. Therefore, full-text search requirements similar to #4201 and #4005 are necessary.

@xqliu
Copy link
Author

xqliu commented Sep 29, 2024

image

这是 chunk 的信息,问的问题召回不到相关内容的 chunk --> 这个地方现在能有个调整的地方吗?确实你给的那两个 feature request 里面的建议我觉得应该可以解决这个问题。

另外就是,是否我现在把所有的文档都放在助手的 System Prompt 可以暂时解决这个问题呢?

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


image

This is chunk information. The question you asked cannot recall chunks with relevant content --> Is there any adjustment to this place now? Indeed, I think the suggestions in the two feature requests you gave should solve this problem.

In addition, can I temporarily solve this problem by placing all documents in the Assistant's System Prompt?

@arvinxx
Copy link
Contributor

arvinxx commented Sep 29, 2024

这是 chunk 的信息,问的问题召回不到相关内容的 chunk --> 这个地方现在能有个调整的地方吗?确实你给的那两个 feature request 里面的建议我觉得应该可以解决这个问题。

目前没有地方调整,这块的确是我们后续想做的。允许人工手动优化分块,这样调整出来的知识库可以保证效果最佳。

另外就是,是否我现在把所有的文档都放在助手的 System Prompt 可以暂时解决这个问题呢?

可以的,RAG 本质上就是自动注入上下文。你手动添加效果是一样的,但这样就是费 token 一些

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


This is chunk information. The question you asked cannot recall the chunk with relevant content --> Is there any adjustment to this place now? Indeed, I think the suggestions in the two feature requests you gave should solve this problem.

There is currently no adjustment. This is indeed what we want to do in the future. Allows manual optimization of chunking, so that the adjusted knowledge base can be divided according to your needs, ensuring the best effect.

In addition, can I temporarily solve this problem by placing all documents in the Assistant's System Prompt?

Yes, RAG essentially injects context automatically. The effect of adding it manually is the same, but it costs more tokens.

@arvinxx arvinxx added the knowledge base knowledge base / RAG / file chunk label Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 Bug Something isn't working | 缺陷 knowledge base knowledge base / RAG / file chunk
Projects
Status: Roadmap - Chat 1.x
Development

No branches or pull requests

3 participants