An intelligent document search engine that leverages natural language processing techniques to provide relevant and personalized search results. Powered by Flask, TF-IDF, and cosine similarity.
- Text Preprocessing: Tokenization, stop word removal, and lemmatization
- Inverted Index Construction: Allows efficient term-based lookups
- TF-IDF Calculation: Measures the importance of terms in each document
- Cosine Similarity: Computes similarity between the query and documents for ranking
- Spell Checking: Automatically corrects misspelled terms in user queries
- Web Interface: Search through documents using a simple HTML form
- Python 3.10+
- Internet connection (for downloading NLTK stopwords and spaCy model)
-
Clone the Repository
git clone https://github.com/Zilean12/Search-Engine.git
cd Search-Engine
-
Install Required Packages Install the necessary Python packages listed in
requirements.txt
:pip install -r requirements.txt
-
Download spaCy Model
python -m spacy download en_core_web_sm
-
Download NLTK Data Download the stopwords dataset from NLTK
-
Run the Application Start the Flask app by running:
python app.py
The app will be available at http://127.0.0.1:5000
.
1. app.py
: Main application file with text processing, TF-IDF calculation, and Flask routes.
2. templates/index.html
: HTML template for the search interface.
3. static/style.css
: CSS file for styling the web interface.
4. requirements.txt
: List of required Python packages.
- Open the app in your browser (
http://127.0.0.1:5000
). - Enter a search query in the input box and click "Search."
- The application will display documents ranked by relevance to the query, showing their cosine similarity scores. Misspelled terms in the query will be automatically corrected.
The text is converted to lowercase, punctuation is removed, stop words are removed, and remaining words are stemmed.
An inverted index is created to store document IDs for each unique term, facilitating fast lookup of terms in documents.
The TF-IDF score is calculated for each term in each document. TF (Term Frequency) and IDF (Inverse Document Frequency) scores are used to measure term importance.
The similarity between the query and each document is calculated using cosine similarity, which helps rank documents based on relevance.
The application uses a custom spell checker to automatically correct misspelled terms in user queries, improving the search experience.
- Flask: Web framework for Python, used for handling HTTP requests and serving the web application.
- NLTK (Natural Language Toolkit): Used for text preprocessing tasks, such as removing stopwords.
- NumPy: Provides support for numerical operations and vector calculations, essential for data processing.
- Tabulate: Formats data in tables for improved readability in the console.
- Colorama: Cross-platform library for adding color formatting to terminal output, making console messages more intuitive.
- spaCy: Advanced NLP library, used with the
en_core_web_sm
model to support text processing and tokenization. - rapidfuzz: Library for fuzzy string matching, enhancing search capabilities by identifying approximate matches.