This project employs Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Sentence-BERT (SBERT) on the MS MARCO dataset, enabling semantic search functionality across these models. By utilizing GloVe embeddings of the documents and comparing them with provided queries using cosine similarity, it establishes a baseline for model comparison. Key evaluation metrics such as Precision, Average Precision, Recall, F1-Score, and Mean Average Precision (MAP) are computed to assess model performance.
-- nltk
-- tqdm
-- gensim
-- scipy
-- numpy
-- sklearn
-- sentence_transformers
-- Pytorch
-- GloVe embeddings
Clone the project
git clone https://github.com/zthsk/semantic_search.git
Go to the project directory
cd semantic_search
Install dependencies
pip install nltk
pip install tqdm
pip install gensim
pip install scipy
pip install numpy
pip install scikit-learn
pip install sentence-transformers
pip install torch torchvision torchaudio
Train the LSA, LDA, BERT, and GloVe
python train_models.py --bert sbert_embeddings.npy
python train_models.py --lsa lsa_model.pny
python train_models.py --lda lda_model.pny
python train_models.py --glove glove_embeddings.npy
Query the model with a single query
python query.py --model [bert, lsa, lda] --query "your query"
Query the model with a list of queries
./run_queries.sh # just update the queries you want in queries.txt
Use the analysis.ipynb file to produce the following images: