Skip to content
/ qag Public

this is a repository for question and answer generation (QAG). here we train answer extraction (AE) and question generation (QG) models. models with soon be publicly available at pbe.achybl.com

Notifications You must be signed in to change notification settings

MatousAc/qag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Environment Setup

Recreate

This directory includes several complex environments. You should be able to use the various .yml files to create new conda environments. If you don't already use conda, you need to install conda from here.

conda create --name <envName> --file <envName>.yml

The two environments available in this folder are for the SoC GPUs and for a local (conda) python installation. The cuda version is for the SoC GPUs. Note that you may have to install some cuda toolkit software before the environment installs correctly.

Here is an explanation of each saved environment:

Current envs

  • qagHfCuda.yml - main environment to use for this repository and thesis. use this for training LLaMA 2 with the trainer, data formatter, data processor, and other scripts in the src/ directory.
  • qagHf.yml - the non-cuda, local version of the environment. should allow for data processing and inference based on the final model

Archived Envs

  • fastT5.yml - used for generating ONNX models of the Potsawee T5 QAG model. Creating the models seemed to work, but opening them and running inference never worked.
  • qagLmqg.yml && qagLmqgCuda.yml - used to test the lmqg python package for model training. This did not work with LLaMA 2.
  • qagT5.yml && qagT5Cuda.yml - main environments used during the attempts to train T5 for QAG. never fully worked, but showed promise. also includes packages for Optimum ONNX generation which did work

Start from scratch

Otherwise start a conda environment from scratch with:

conda create -n aqg
conda activate aqg
conda install python=3.11.5

Install the python packages that you need.

For optimum and t5:

pip install datasets evaluate fastt5 huggingface kaggle pandas numpy onnx onnxruntime optimum tokenizers torch transformers nltk

For LLaMA 2 training:

pip install datasets evaluate huggingface numpy pandas transformers tokenizers torch

It's up to you to figure out how to install cuda toolkit on your machine.

Training

To fine-tune a new model set up the proper config for the model type you are training. There are three types:

  1. AE
  2. QG
  3. E2E

Then run trainer.py. The run stats should be available in the pbe_qag team on wandb.ai. Each model type is it's own "project."

To run AE and QG traing back-to-back, just start with type in qag.ini as AE and run:

python trainer.py; python trainer.py

And then once the first training has started, simply change type to QG. Save the file. This will run AE training first and, when complete, will run another trainig, but this time QG as that is what qag.ini specifies.

Generating questions and answers

To run the model, specify whether you want pipeline or end-to-end generation in the ocnfig file, qag.ini. Values of AE or QG will result in pipeline generation. E2E will enable end-to-end generation. Then, just run generator.py.

python generator.py

You will be in an inference loop where you can enter a verse reference for generation or press enter for a random verse.

The configuration for the project is in qag.ini. This file determines the current model source, data source, and inference prompt used. Data && logs are kept in the data folder.

Resources

The resources folder is for miscellaneous project resources.

About

this is a repository for question and answer generation (QAG). here we train answer extraction (AE) and question generation (QG) models. models with soon be publicly available at pbe.achybl.com

Topics

Resources

Stars

Watchers

Forks

Languages