Request for Assistance in Training LLMs Using RAG for Educational Chatbots #4287
-
Dear sir, I hope this message finds you well. My name is Ali Sheikhali, and I am an AI developer. Over the past three months, I have been working on a specialized AI chatbot designed for educational purposes. The chatbot enables students to ask academic questions and receive answers. However, given the nature of some highly specific academic questions, a standalone AI may not always be able to provide accurate responses. To address this, I am interested in implementing a Retrieval-Augmented Generation (RAG) system to train language models such as Llama or Cloud Sonet. I have a large dataset consisting of lesson-related PDFs and Word documents that include questions from subjects like Mathematics, Physics, and Biology. These documents also contain mathematical formulas, charts, and occasionally tables. To ensure accurate results, I aim to chunk the content of these PDFs and Word files correctly for training purposes. While researching online, I came across your insightful video on YouTube (regarding "AutoGen Explained"), where you discussed topics such as "Multi-Agent Systems" I found your explanation highly relevant to my project, particularly in terms of effectively processing and structuring PDF content. To better explain my use case, I have attached a sample from my dataset as an image. I would greatly appreciate your guidance on how to properly train LLMs using my dataset and how to integrate the RAG system into this process. Any advice, resources, or recommendations you can provide would be incredibly valuable for my project. Thank you very much for your time and assistance. I look forward to your response. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @Alisheikhalii, In the end there are multiple concerns being discussed here. First, trying to get the model to give more accurate answers by augmenting its prompts with a dataset (RAG) concerns grounding the model, which is different from training a model (building the model waits through a computational process) or fine-tuning the model (adjusting the model weights based on your data). Most likely you can accomplish what you want with advanced LLMs such as GPT4o and RAG - without specific fine-tuning. For smaller models that may not be true. Now - the language models are known for not being great at math - they are after all stochastic vs deterministic. For logic and math reasoning there is a whole lot of research about how to steer the models in the right direction, and some models such as PHI perform better than others. I suggest doing some research on that front. WRT RAG techniques and AutoGen, there really is not a specific integration or implementation - it's treated as a DIY/separate concern for the most part, so following RAG examples from other SDKs will work fine -- eg we have some samples with llama-index RAG and autogen. |
Beta Was this translation helpful? Give feedback.
Hi @Alisheikhalii,
Thanks for you post!
In the end there are multiple concerns being discussed here. First, trying to get the model to give more accurate answers by augmenting its prompts with a dataset (RAG) concerns grounding the model, which is different from training a model (building the model waits through a computational process) or fine-tuning the model (adjusting the model weights based on your data). Most likely you can accomplish what you want with advanced LLMs such as GPT4o and RAG - without specific fine-tuning. For smaller models that may not be true. Now - the language models are known for not being great at math - they are after all stochastic vs deterministic. For logic …