Skip to content

Demonstrating techniques for text document processing, including vector space modeling, cosine similarity computation, and other information retrieval methods. Ideal for learning and implementing basic IR concepts.

Notifications You must be signed in to change notification settings

Someshdiwan/Information-Retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Document Processing

A collection of scripts and examples demonstrating techniques for text document processing, including vector space modeling, cosine similarity computation, and other information retrieval (IR) methods. This repository is ideal for learning and implementing basic IR concepts, text classification, web crawling, and document preprocessing.

GitHub License GitHub stars


🚀 Overview

This repository showcases several fundamental and advanced techniques in text document processing and information retrieval (IR), including methods for text classification, vector space modeling, similarity computation, and web crawling.

Key Techniques:

  • Text Preprocessing: Text cleaning, stop word removal, stemming, and lemmatization.
  • Vector Space Model (VSM): Representing documents as vectors in a high-dimensional space for processing.
  • Cosine Similarity: Computing the similarity between documents using the cosine similarity measure.
  • Naive Bayes Classifier: Text classification using the Naive Bayes algorithm (GaussianNB).
  • Web Crawling: Crawling websites to extract news stories with domain filtering.

Text Processing


🔧 Features

  • Text Classification: Naive Bayes classifier for text classification and prediction tasks.
  • Document Preprocessing: Techniques for cleaning and preparing text documents for analysis.
  • Cosine Similarity: Implementation of cosine similarity to compare and measure the similarity between documents.
  • Web Crawling: Scripts for crawling news websites and collecting relevant text content.
  • XML Parsing: Basic example of parsing and modifying XML documents in Python.

🌐 Demo

You can try out the various techniques demonstrated in this repository by running the provided Python scripts or Jupyter notebooks. The projects include:

  • Text classification using Naive Bayes (GaussianNB)
  • Cosine similarity computation for document comparison
  • Web crawling to extract news stories from websites
  • XML document processing for parsing and modification

Dependencies:

To run the examples, you will need the following libraries:

  • Python 3.x
  • scikit-learn (for Naive Bayes and vectorizer)
  • pandas
  • numpy
  • requests
  • BeautifulSoup (for web scraping)
  • nltk (for text preprocessing)
  • lxml (for XML parsing)

Install them using pip:

pip install


🛠️ Technologies Used Python 3.x scikit-learn (for machine learning and vector space modeling) pandas numpy nltk (for natural language processing) BeautifulSoup (for web scraping) lxml (for XML parsing) Jupyter Notebooks (for interactive demos)

📂 Project Structure

Text-Document-Processing/
├── notebooks/               # Jupyter notebooks for each technique
├── data/                    # Datasets for testing and training models
├── README.md                # Project documentation

Running the Code Clone the repository:

git clone https://github.com/Someshdiwan/Text-Document-Processing


🌟 Show Your Support
If you like this project, please consider giving it a ⭐ on GitHub!

🤝 Contributing
We welcome contributions to improve the repository! If you have any enhancements, bug fixes, or new project ideas, feel free to fork the repository, make changes, and submit a pull request.

About

Demonstrating techniques for text document processing, including vector space modeling, cosine similarity computation, and other information retrieval methods. Ideal for learning and implementing basic IR concepts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published