ColBERT-v2 PT-BR

This repository contains the code for the final project of IA-368 (Deep Learning for Information Retrieval) of Unicamp (University of Campinas) taken during the first semester of 2023.

Set-up

We strongly suggest you create and activate a virtual environment to run this project. This can be achieved with:

python -m venv /path/to/venv
source /path/to/venv/bin/activate

Then, install the requirements:

pip install -r requirements.txt

You may not be able to run most of this code without a CUDA device.

Usage

Generating triples with distillation

You'll need a BM25 index. This can be achieved by following the steps from mMARCO. You'll also need the portuguese_queries.train.tsv file. If you follow the same paths to save data, you can simply run:

python generate_dataset.py

Training the checkpoint

To train a new checkpoint of the model, you'll need to clone the original ColBERT repo, since we use much of the original code. Run:

python train.py

Indexing the collection

For this step, you need a trained checkpoint. This step also depends on the original ColBERT repo. Change the needed information on the indexing.py script and run:

python indexing.py

Retrieval

For this step, you need a trained checkpoint and an indexed collection. This step also depends on the original ColBERT repo. Change the needed information on the retrieval.py script and run:

python retrieval.py

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
final_report.pdf		final_report.pdf
generate_dataset.py		generate_dataset.py
indexing.py		indexing.py
requirements.txt		requirements.txt
retrieval.py		retrieval.py
run_train.sh		run_train.sh
run_triple.sh		run_triple.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ColBERT-v2 PT-BR

Set-up

Usage

Generating triples with distillation

Training the checkpoint

Indexing the collection

Retrieval

About

Releases

Packages

Contributors 2

Languages

juliatessler/P_IA368DD_2023S1-colbertv2-ptbr

Folders and files

Latest commit

History

Repository files navigation

ColBERT-v2 PT-BR

Set-up

Usage

Generating triples with distillation

Training the checkpoint

Indexing the collection

Retrieval

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages