Document Classifier and Document Clustering

This repository contains Python code for a text analysis application that demonstrates two text analysis techniques: Document Classification and Document Clustering. It uses the data in the ResearchPapers directory and calculates the TF and IDF values to make the TF-IDF scores of the documents. These TF-IDF scores is then used for both Document Classification and Document Clustering tasks.

Overview

Document Classification: This is a supervised learning technique that assigns a label to a text document. The labels can be binary, multi-class, or multi-label. The app uses the K-Nearest Neighbors algorithm to classify the text documents.
Document Clustering: This is an unsupervised learning technique that groups similar text documents together. The goal is to discover the underlying structure in the text data. The app uses the K-Means algorithm to cluster the text documents.

Components

1. Weight Calculation

File: weight_calculation.py
Description: This script calculates the Term Frequency (TF), Inverse Document Frequency (IDF), and TF-IDF weights for the terms in the text documents. It preprocesses the text files, extracts stopwords, and saves the weights to CSV files.

2. Homepage

File: Homepage.py
Description: This script sets up the homepage of the application using Streamlit. It provides an overview of the app and its functionalities.

3. Document Classification

File: Document_Classification.py
Description: This script implements the document classification functionality of the application. It trains a K-Nearest Neighbors model using TF-IDF weights and evaluates the model using accuracy, recall, F1-score, and precision metrics.

4. Document Clustering

File: Document_Clustering.py
Description: This script implements the document clustering functionality of the application. It trains a K-Means clustering model using TF-IDF weights and evaluates the model using purity, silhouette score, and adjusted Rand index metrics. It also generates an Elbow Chart and displays the top keywords for each cluster.

Getting Started

To run the information retrieval system, follow these steps:

Ensure you have Python 3.12 installed.
Install all the dependencies by running "pip install -r requirements.txt" in the terminal.
Make sure Stopword-List.txt and the Research Paper directory containing all the documents is in your current working directory.
Run the files in an IDE.
Run this command to download the tokennizer nltk.download('punkt')
Run the 'weights_calculation.py' script first using 'python weights_calculation.py' to create and save the weights.
Type python -m streamlit run Homepage.py on the terminal.
Navigate through different pages of the Streamlit GUI using the sidebar.
Press the cross button at the top right of the app to close the app.

Usage

Clone the repository: git clone https://github.com/zohaibterminator/text-analysis.git
Install the required libraries: pip install -r requirements.txt
Run the streamlit app python -m streamlit run Homepage.py
Open the app in your browser and explore the text classification and text clustering functionalities.

License

This project is licensed under the MIT License.

Acknowledgments

This project was inspired by information retrieval and machine learning concepts.
Special thanks to the developers of NLTK for providing essential natural language processing tools.
Special thanks to developers of Streamlit that was used to develop the GUI of the app.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Classifier and Document Clustering

Overview

Components

1. Weight Calculation

2. Homepage

3. Document Classification

4. Document Clustering

Getting Started

Usage

License

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
ResearchPapers		ResearchPapers
pages		pages
Homepage.py		Homepage.py
README.md		README.md
Stopword-List.txt		Stopword-List.txt
requirements.txt		requirements.txt
weights_calculation.py		weights_calculation.py

zohaibterminator/text-analysis

Folders and files

Latest commit

History

Repository files navigation

Document Classifier and Document Clustering

Overview

Components

1. Weight Calculation

2. Homepage

3. Document Classification

4. Document Clustering

Getting Started

Usage

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages