Warning work in progress!
A basic search engine that helps you index a corpus to search and rank the document data set. Built using Python and object-oriented programming principles to make the project extendable and maintainable.
Features:
- Inverted Index - to improve search times.
- Results Ranking - with term frequency–inverse document frequency (TF-IDF) to order results by relevance.
- Query Expansion - to automatically add additional query terms (like synonyms) to improve results relevancy (see my testing analysis).
- Result Evaluation - test and compare results with human-evaluated relevancy scores to gauge performance.
This started out as a course project, and I'm currently working on building this out further and adding more features to it. I'm planning to build out a front-end web interface so I can demo this project better. I will also be adding additional functionality to build on the project.
ToDo:
- Spit up files and organize into packages.
- Write Documentation!
- Finish implementing stop words functionality.
- Build a frontend web interface to the demo project.
- Result snippet generation.
- Implement advanced search operators (OR, NOT).
- Improve query normalization.
- Ranking improvements.
- Add caching and on-demand loading to improve memory efficiency.
I hope to writing some more conprehensive documentation for this project in the near future.
Stay tuned :)