IR like a SIR:Sense-enhanced Information Retrieval for Multiple Languages

SIR (Sense-enhanced Information Retrieval) brings Word Sense Disambiguation (WSD) and Information Retrieval closer and provides additional semantic information for the query via sense definitions. This semantic information leads to improvements over a baseline that does not access semantics in multiple languages.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{blloshmi-etal-2021-ir,
  title     = {IR like a SIR: Sense-enhanced Information Retrieval for Multiple Languages},
  author    = {Blloshmi, Rexhina and Pasini, Tommaso and Campolungo, Niccolò, Banerjee, Somnath and Navigli, Roberto and Pasi, Gabriella},
  booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing {EMNLP-2021}},
  publisher = {Association for Computational Linguistics},
  year      = {2021},
}

1. Setup

I. Create a conda environment with Python 3.6 and PyTorch 1.7.1 and install the dependencies requirements.txt.

Via conda:

conda create -n sir python=3.6
conda activate sir
pip install -r requirements.txt

To enable wandb logging:

wandb login

II. Download all the artifacts:

Download ARES embeddings indeces and replace the empty ares directory inside data/

Anonymized Link: ARES Indeces

Download pretrained checkpoints and place them inside out/

Anonymized Link: Checkpoints

Via conda:

2. SIR Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

Train on Robust04 folds:

bash scripts/run_robust_sense_aware.sh

Evaluate on Robust04 folds:

bash scripts/rerank_robust_sense_aware.sh

*These scripts contain the paths to fold1 of the Robust04 data. To run for other folds as well, please replace f1 in train_pairs, valid_run and model_out_dir paths with the corresponding fold. Choices --> {f1, f2, f3, f4, f5}

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

Train in Robust04 merged folds:

bash scripts/run_robust_sense_aware_mult.sh

Evaluate in CLEF:

bash scripts/rerank_robust_sense_aware_mult.sh

3. VanillaBERT Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

Train on Robust04 folds:

bash scripts/run_robust.sh

Evaluate on Robust04 folds:

bash scripts/rerank_robust.sh

*These scripts contain the paths to fold1 of the Robust04 data. To run for other folds as well, please replace f1 in train_pairs, valid_run and model_out_dir paths with the corresponding fold. Choices --> {f1, f2, f3, f4, f5}

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

Train in Robust04 merged folds:

bash scripts/run_robust_mult.sh

Evaluate in CLEF:

bash scripts/rerank_robust_mult.sh

*Due to proprietary nature of CLEF data, we do not provide the data for testing the models in this setting. We provide all the preprocessing steps starting from the original corpora in src/dataset_creation.

conda create -n sir python=3.6
conda activate sir
pip install -r requirements.txt

To enable wandb logging:

wandb login

II. Download all the artifacts:

Download ARES embeddings indeces and replace the empty ares directory inside data/

Anonymized Link: ARES Indeces

Download pretrained checkpoints and place them inside out/

Anonymized Link: Checkpoints

2. SIR Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

Train on Robust04 folds:

bash scripts/run_robust_sense_aware.sh

Evaluate on Robust04 folds:

bash scripts/rerank_robust_sense_aware.sh

*These scripts contain the paths to fold1 of the Robust04 data. To run for other folds as well, please replace f1 in train_pairs, valid_run and model_out_dir paths with the corresponding fold. Choices --> {f1, f2, f3, f4, f5}

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

Train in Robust04 merged folds:

bash scripts/run_robust_sense_aware_mult.sh

Evaluate in CLEF:

bash scripts/rerank_robust_sense_aware_mult.sh

3. VanillaBERT Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

Train on Robust04 folds:

bash scripts/run_robust.sh

Evaluate on Robust04 folds:

bash scripts/rerank_robust.sh

*These scripts contain the paths to fold1 of the Robust04 data. To run for other folds as well, please replace f1 in train_pairs, valid_run and model_out_dir paths with the corresponding fold. Choices --> {f1, f2, f3, f4, f5}

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

Train in Robust04 merged folds:

bash scripts/run_robust_mult.sh

Evaluate in CLEF:

bash scripts/rerank_robust_mult.sh

*Due to proprietary nature of CLEF data, we do not provide the data for testing the models in this setting. We provide all the preprocessing steps starting from the original corpora in src/dataset_creation.

License

This project is released under the CC-BY-NC-SA 4.0 license (see LICENSE). If you use SIR, please put a link to this repo.

Acknowledgements

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 and the ELEXIS project No. 731015 under the European Union’s Horizon 2020 research and innovation programme.

This work was supported by the PerLIR project (Personal Linguistic resources in Information Retrieval) funded by the MIUR Progetti di ricerca di Rilevante Interesse Nazionale programme (PRIN 2017).
This work was supported in part by the MIUR under the grant "Dipartimenti di eccellenza 2018-2022" of the Department of Computer Science of the Sapienza University of Rome.

We adopted modules or code snippets from the open-source projects:

CEDR

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
img		img
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IR like a SIR:Sense-enhanced Information Retrieval for Multiple Languages

1. Setup

2. SIR Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

3. VanillaBERT Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

2. SIR Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

3. VanillaBERT Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

License

Acknowledgements

About

Releases

Packages

Languages

License

SapienzaNLP/sir

Folders and files

Latest commit

History

Repository files navigation

IR like a SIR:Sense-enhanced Information Retrieval for Multiple Languages

1. Setup

2. SIR Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

3. VanillaBERT Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

2. SIR Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

3. VanillaBERT Training & Evaluation

I. English monolingual retrieval in TREC Robust04* data:

II. Non-English monolingual retrieval in CLEF 2000-2003* data:

License

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages