Skip to content

Develop a custom search engine using Information Retrieval

Notifications You must be signed in to change notification settings

Zilean12/Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Simple Search Engine

Python Version Flask Version

An intelligent document search engine that leverages natural language processing techniques to provide relevant and personalized search results. Powered by Flask, TF-IDF, and cosine similarity.

Table of Contents

📋 Features

  • Text Preprocessing: Tokenization, stop word removal, and lemmatization
  • Inverted Index Construction: Allows efficient term-based lookups
  • TF-IDF Calculation: Measures the importance of terms in each document
  • Cosine Similarity: Computes similarity between the query and documents for ranking
  • Spell Checking: Automatically corrects misspelled terms in user queries
  • Web Interface: Search through documents using a simple HTML form

🛠️ Prerequisites

  • Python 3.10+
  • Internet connection (for downloading NLTK stopwords and spaCy model)

🚀 Installation

  1. Clone the Repository

    git clone https://github.com/Zilean12/Search-Engine.git
    cd Search-Engine
  2. Install Required Packages Install the necessary Python packages listed in requirements.txt:

    pip install -r requirements.txt
  3. Download spaCy Model

    python -m spacy download en_core_web_sm
  4. Download NLTK Data Download the stopwords dataset from NLTK

  5. Run the Application Start the Flask app by running:

    python app.py
    

The app will be available at http://127.0.0.1:5000.

🗂️ Project Structure


  1. app.py: Main application file with text processing, TF-IDF calculation, and Flask routes.

 2. templates/index.html: HTML template for the search interface.

 3. static/style.css: CSS file for styling the web interface.

 4. requirements.txt: List of required Python packages.

🔍 Usage


  1. Open the app in your browser (http://127.0.0.1:5000).
  2. Enter a search query in the input box and click "Search."
  3. The application will display documents ranked by relevance to the query, showing their cosine similarity scores. Misspelled terms in the query will be automatically corrected.

🔑 Key Components


Text Preprocessing

The text is converted to lowercase, punctuation is removed, stop words are removed, and remaining words are stemmed.

Inverted Index

An inverted index is created to store document IDs for each unique term, facilitating fast lookup of terms in documents.

TF-IDF Calculation

The TF-IDF score is calculated for each term in each document. TF (Term Frequency) and IDF (Inverse Document Frequency) scores are used to measure term importance.

Cosine Similarity

The similarity between the query and each document is calculated using cosine similarity, which helps rank documents based on relevance.

Spell Checking

The application uses a custom spell checker to automatically correct misspelled terms in user queries, improving the search experience.

🧰 Dependencies


  • Flask: Web framework for Python, used for handling HTTP requests and serving the web application.
  • NLTK (Natural Language Toolkit): Used for text preprocessing tasks, such as removing stopwords.
  • NumPy: Provides support for numerical operations and vector calculations, essential for data processing.
  • Tabulate: Formats data in tables for improved readability in the console.
  • Colorama: Cross-platform library for adding color formatting to terminal output, making console messages more intuitive.
  • spaCy: Advanced NLP library, used with the en_core_web_sm model to support text processing and tokenization.
  • rapidfuzz: Library for fuzzy string matching, enhancing search capabilities by identifying approximate matches.

About

Develop a custom search engine using Information Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published