Skip to content

An information retrival system for Persian language.

Notifications You must be signed in to change notification settings

radinshayanfar/AUT-IR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Information Retrieval Engine

Implementation of an information retrieval engine as AUT IR course project.

Instructor: Dr. A. Nickabadi

Semester: Spring 2022

Usage

See the help command by entering python main.py --help.

usage: MyIRSystem [-h] [-hl] [-zl] [-l index.pkl | -s index.pkl] [-r | -b] collection.json

positional arguments:
  collection.json       collection .json file path

optional arguments:
  -h, --help            show this help message and exit
  -hl, --heaps-law      demonstrate heaps law
  -zl, --zipf-law       demonstrate zipf law
  -l index.pkl, --load index.pkl
                        load previously saved index
  -s index.pkl, --save index.pkl
                        save built index
  -r, --ranked          ranked retrieval using tf-idf
  -b, --boolean         boolean retrieval using positional index

If the inverted index is saved beforehand, it can be loaded using --load switch. Otherwise, it will build a new index, which could be saved by --save switch.

There are two operation modes; boolean and ranked. In ranked mode, documents are scored based on a tf-idf weighting scheme. In boolean mode, the number of terms' occurrences is used to rank the results. Supported operators for boolean mode are:

  • And: The default operator by using space between terms in a query; A B
  • Not: By using ! before a term; A ! B
  • Phrase: By using " around the phrase; "A B"

A sample input collection is available here.

Preprocessing and tokenization are done by the hazm library (for the Persian language).