Request for Assistance in Training LLMs Using RAG for Educational Chatbots #4287

Alisheikhalii · 2024-11-20T19:34:33Z

Alisheikhalii
Nov 20, 2024

Dear sir,

I hope this message finds you well. My name is Ali Sheikhali, and I am an AI developer. Over the past three months, I have been working on a specialized AI chatbot designed for educational purposes. The chatbot enables students to ask academic questions and receive answers. However, given the nature of some highly specific academic questions, a standalone AI may not always be able to provide accurate responses. To address this, I am interested in implementing a Retrieval-Augmented Generation (RAG) system to train language models such as Llama or Cloud Sonet.

I have a large dataset consisting of lesson-related PDFs and Word documents that include questions from subjects like Mathematics, Physics, and Biology. These documents also contain mathematical formulas, charts, and occasionally tables. To ensure accurate results, I aim to chunk the content of these PDFs and Word files correctly for training purposes.

While researching online, I came across your insightful video on YouTube (regarding "AutoGen Explained"), where you discussed topics such as "Multi-Agent Systems" I found your explanation highly relevant to my project, particularly in terms of effectively processing and structuring PDF content.

To better explain my use case, I have attached a sample from my dataset as an image. I would greatly appreciate your guidance on how to properly train LLMs using my dataset and how to integrate the RAG system into this process. Any advice, resources, or recommendations you can provide would be incredibly valuable for my project.

Thank you very much for your time and assistance. I look forward to your response.

Answered by rysweet

Nov 21, 2024

Hi @Alisheikhalii,
Thanks for you post!

In the end there are multiple concerns being discussed here. First, trying to get the model to give more accurate answers by augmenting its prompts with a dataset (RAG) concerns grounding the model, which is different from training a model (building the model waits through a computational process) or fine-tuning the model (adjusting the model weights based on your data). Most likely you can accomplish what you want with advanced LLMs such as GPT4o and RAG - without specific fine-tuning. For smaller models that may not be true. Now - the language models are known for not being great at math - they are after all stochastic vs deterministic. For logic …

View full answer

rysweet · 2024-11-21T16:47:30Z

rysweet
Nov 21, 2024
Collaborator

Hi @Alisheikhalii,
Thanks for you post!

In the end there are multiple concerns being discussed here. First, trying to get the model to give more accurate answers by augmenting its prompts with a dataset (RAG) concerns grounding the model, which is different from training a model (building the model waits through a computational process) or fine-tuning the model (adjusting the model weights based on your data). Most likely you can accomplish what you want with advanced LLMs such as GPT4o and RAG - without specific fine-tuning. For smaller models that may not be true. Now - the language models are known for not being great at math - they are after all stochastic vs deterministic. For logic and math reasoning there is a whole lot of research about how to steer the models in the right direction, and some models such as PHI perform better than others. I suggest doing some research on that front.

WRT RAG techniques and AutoGen, there really is not a specific integration or implementation - it's treated as a DIY/separate concern for the most part, so following RAG examples from other SDKs will work fine -- eg we have some samples with llama-index RAG and autogen.
for semi-structured datasets it often makes sense too leverage an adv RAG technique that builds a Graph over the metadata and to leverage that graph at retrieval time to improve the accuracy of your retrieval wrt your dataset. That's again not really a concern that AutoGen manages though you can factor approaches to graph curation and navigation using agents.

1 reply

Alisheikhalii Nov 22, 2024
Author

First of all, thank you for your guidance. I now have a question based on what you mentioned. You stated that because of the probabilistic nature of language models, they are inherently not great at tasks like math, and that RAG alone might not solve the problem. You suggested either training a model or researching models like PHI. However, I’m thinking that if I can preprocess my dataset files (for example, those related to math), which include both text questions and mathematical formulas in image format, and convert those formulas into text format like LaTeX, then RAG might work. Essentially, the idea is that once the formulas are converted to LaTeX, the system can embed the text properly, and when a user asks a math question, the input can also be preprocessed to convert both the text and formulas into text and LaTeX format. This way, RAG can find the relevant chunk and send it to the model.

Do you think this approach is feasible? This is the plan I have in mind, but I’ve hit a roadblock when it comes to preprocessing the dataset files and converting them into a structured format. I’ve tested some OCR tools for recognizing formulas, but they didn’t work well. Could you guide me or help me figure out how to preprocess datasets with mathematical formulas in a way that works well with RAG? Additionally, how can I preprocess users’ mathematical questions so that when they’re sent to RAG, it can accurately match them with the relevant chunks?

I’d appreciate your advice on whether this approach is viable. If you think the method I’m following is flawed, could you suggest an alternative solution for achieving this goal? Thank you in advance for your guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Assistance in Training LLMs Using RAG for Educational Chatbots #4287

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Request for Assistance in Training LLMs Using RAG for Educational Chatbots #4287

Alisheikhalii Nov 20, 2024

Replies: 1 comment · 1 reply

rysweet Nov 21, 2024 Collaborator

Alisheikhalii Nov 22, 2024 Author

Alisheikhalii
Nov 20, 2024

Replies: 1 comment 1 reply

rysweet
Nov 21, 2024
Collaborator

Alisheikhalii Nov 22, 2024
Author