Threshold-based Correlation Clustering (TBCC) Algorithm

This repository contains the research paper titled "Optimizing Document Clustering through Correlation-Driven Cluster Formation". The paper presents an innovative approach to document clustering optimization using correlation-driven techniques. This repository also contains the script which is used to test the accuracy of the clustering algorithm.

Introduction

This repository contains the implementation of the Threshold-based Correlation Clustering (TBCC) algorithm for document clustering. The algorithm optimizes clusters dynamically based on correlation thresholds, enhancing clustering accuracy and efficiency.

Methodology

Data Preprocessing: Converts raw text data into TF-IDF vectors, including tokenization, lemmatization, and stop word removal.
Correlation Matrix Calculation: Uses Spearman's Rank Correlation Coefficient to analyze semantic relationships.
Cluster Formation: Identifies cohesive semantic groups based on correlation values.
Cluster Optimization: Merges clusters iteratively using the Jaccard Coefficient to improve cohesion.
Cluster Refinement: Refines clusters by reassigning unassigned columns.

Experiments

Dataset: 200 questions on Biotechnology, DBMS, Networking, and Climate Change (arbitary datasets).
Tools: Python, Scikit-learn, NumPy, Pandas, NLTK.
Parameter Settings: Threshold values varied from 0.2 to 0.3 in increments of 0.01. (This hyper-paramter must be set for efficient clustering)

No of clusters vs Threshold value

Identification of Optimum Threshold Value

From this we have identified that for the given Input files, the optimum threshold value is 0.23 or 0.25.

Results

Comparison Algorithms and Indices:
- Compared Algorithms: K-means, Affinity Propagation , Gaussian Mixture Model and Agglomerative Clustering.
- Comparison Indices: Silhouette Score, Calinski Harabasz Score, Davies-Bouldin Index, Adjusted Rand Score, and Normalized Mutual Information Score.
Accuracy Analysis:
- Our TBCC algorithm achieved an accuracy of:
  - Silhouette Score: 0.2445
  - Calinski Harabasz Score: 122.0241
  - Davies-Bouldin Index: 0.8524
  - Adjusted Rand Score: 0.8721
  - Normalized Mutual Information Score: 0.8694
Experiment Results: For better evaluation of the algorithm, two comparisons are done on the clustering algorithms. The comparsions are as as follows:
- Comparative analysis when number of clusters is set to numbers of clusters corresponding to the optimum the Threshold value (T=25)
- Comparative analysis when number of clusters is set to numbers of clusters corresponding to the no of topics (N=4)
Overall Performance:
- The TBCC algorithm demonstrated superior performance across multiple indices, forming more cohesive and accurate clusters compared to traditional clustering methods. The dynamic adaptation to semantic relationships and iterative optimization techniques contributed significantly to its enhanced clustering capability.

Conclusion

The TBCC algorithm effectively forms semantically cohesive clusters, adapting dynamically to varying document relationships. Future work includes exploring automated threshold selection for improved performance.

Future Discussion

Automated Threshold Selection: Developing methods to automatically select optimal correlation thresholds.
Scalability: Enhancing the algorithm's scalability for larger datasets.
Application Domains: Applying the algorithm to various domains like data science, machine learning, and information retrieval.

Limitations

Threshold Sensitivity: Performance is sensitive to the choice of correlation threshold.
Computational Complexity: May require significant computational resources for large datasets.

Testing the Algorithm

To test the TBCC algorithm using the provided main.py file:

Clone the repository:

   git clone https://github.com/imsuraj675/Clustering-Algorithm/

Move to the Clustering-Algorithm directory
```
   cd Clustering-Algorithm
```
Ensure all dependencies are installed.
- Install virtual environemnt (if installed ignore it)
```
pip install virtualenv
```
- Create a virtual environment using the following command.
```
virtualenv env
```
- Install requirements
```
pip install -r requirements.txt
```
- Run the intall_req.py
```
python install_req.py
```
Put all the test document files inside the Input directory.
Run the main.py script:
```
python main.py
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Threshold-based Correlation Clustering (TBCC) Algorithm

Introduction

Methodology

Experiments

Results

Conclusion

Future Discussion

Limitations

Testing the Algorithm

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Input		Input
Results		Results
README.md		README.md
Research_Paper.pdf		Research_Paper.pdf
install_req.py		install_req.py
main.py		main.py
requirements.txt		requirements.txt

imsuraj675/Clustering-Algorithm

Folders and files

Latest commit

History

Repository files navigation

Threshold-based Correlation Clustering (TBCC) Algorithm

Introduction

Methodology

Experiments

Results

Conclusion

Future Discussion

Limitations

Testing the Algorithm

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages