This project comporises Indexer
, Tokenizer
, QueryProcessor
parts. Also, it uses a helper code named FileWorker
for loading dataset and saving checkpoints for indexer and tokenizer sections.
In QueryProcessor
side, we use TF-IDF algorithm for processing every user's query. Also, for determining the similarities between the user query and each document's representation, we use Cosine similarity function in vector space.
NOTE: This project's data preprocessing and augmentation parts are based on persian language.
To run this search engine, we have to run main
file. First, tokenizer
and indexer
instances will be created. After that and with initializing the fileWorker
instance, we can load dataset with either fileIndex
or labeledFileIndex
function from fileWorker
class.
In the end, after some preprocessings, we define the queryProcessor
instance with passing the indexer
and the tokenizer
to it's constructor. We can write our queries in terminal with calling the startListening
function.