Solution for Information Retrieval project of Statistical Natural Language Processing class 2020.
The task is to develop and evaluate a two-stage information retrieval model that given a query returns the n
most relevant documents and then ranks the sentences within the documents.
The baseline is a tf-idf based document retriever. The results are then improved using the Okapi BM25 model. The third part extends this model to ranked sentences.
Python 3.5 or above
- src - main directory
- dataset - contains the corpus and query files
-> generated - intermediate and final generated files (tf.json, idf.json, ranking.json, results.txt, plots) - baseline.py - baseline model implementation
- bm25.py - bm25 ranking implementation
- bm25_sentence.py - sentence based bm25 model implementation
- run.py - main function to execute the code
- dataset - contains the corpus and query files
- analysis.pdf - Analysis and obervations with explanation
- How to run the code - Execute command
-> python run.py