Júlia Tessler and Manoel Veríssimo
This repository contains the code for the final project of IA-368 (Deep Learning for Information Retrieval) of Unicamp (University of Campinas) taken during the first semester of 2023.
We strongly suggest you create and activate a virtual environment to run this project. This can be achieved with:
python -m venv /path/to/venv
source /path/to/venv/bin/activate
Then, install the requirements:
pip install -r requirements.txt
You may not be able to run most of this code without a CUDA device.
You'll need a BM25 index. This can be achieved by following the steps from mMARCO. You'll also need the portuguese_queries.train.tsv
file. If you follow the same paths to save data, you can simply run:
python generate_dataset.py
To train a new checkpoint of the model, you'll need to clone the original ColBERT repo, since we use much of the original code. Run:
python train.py
For this step, you need a trained checkpoint. This step also depends on the original ColBERT repo. Change the needed information on the indexing.py
script and run:
python indexing.py
For this step, you need a trained checkpoint and an indexed collection. This step also depends on the original ColBERT repo. Change the needed information on the retrieval.py
script and run:
python retrieval.py