Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PubMed Text Classification MeSH Model #566

Merged
merged 8 commits into from
Feb 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions PubMed Text Classification/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# PubMed Multi-label Text Classification Dataset

This dataset consists of approximately 50,000 research articles sourced from the PubMed repository. The documents in this collection have been manually annotated by Biomedical Experts with their MeSH (Medical Subject Headings) labels. Each article is described in terms of 10-15 MeSH labels.

## Dataset Details

- **Original Annotation:** Biomedical Experts manually annotated the documents with MeSH labels.
- **Label Space:** The original dataset had a large number of MeSH labels, resulting in an extensive output space and severe label sparsity issues.
- **MeSH Major Labels:** Each article is annotated with MeSH major labels, reducing the label space and addressing label sparsity.

## Data Processing and Label Mapping

To overcome the challenges of an extremely large output space and severe label sparsity, the dataset has undergone processing and mapping to its root labels. The following steps were taken:

1. **Label Reduction:** The original MeSH labels were reduced to their major categories.
2. **Root Mapping:** The major labels were mapped to their corresponding root categories to simplify the output space.

## Label Hierarchy

The MeSH major labels in the dataset have been organized in a hierarchical structure, allowing for a more structured and interpretable representation of the biomedical concepts.

## Data Statistics

- **Number of Articles:** 50,000
- **MeSH Labels:** Originally 10-15 per article, reduced to major labels.
- **Root Labels:** Reduced and mapped categories for more manageable classification.


## Acknowledgments

I extend my gratitude to the Kaggle and Biomedical Experts who manually annotated the documents and contributed this dataset to Kaggle.


Binary file added PubMed Text Classification/Images/pmt1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt10.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt3.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt4.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt5.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt6.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt7.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt8.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt9.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
92 changes: 92 additions & 0 deletions PubMed Text Classification/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# PubMed Multi-label Text Classification

This repository contains code and models for multi-label text classification on the PubMed dataset using BioBERT, RoBERTa, and XLNet.

## Table of Contents

- [Dataset](#dataset)
- [Dataset Analysis](#dataset-analysis)
- [Models and Accuracies](#models-and-accuracies)
- [Training Graphs](#training-graphs)
- [Training Loss vs Number of Epochs](#training-loss-vs-number-of-epochs)
- [F1 Validation Accuracy vs Number of Epochs](#f1-validation-accuracy-vs-number-of-epochs)
- [Flat Validation Accuracy vs Number of Epochs](#flat-validation-accuracy-vs-number-of-epochs)
- [Conclusion](#conclusion)
- [Results](#results)
- [Acknowledgments](#acknowledgments)


## Dataset

The dataset used in this project is available on Kaggle: [PubMed Multi-label Text Classification](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification)

### Dataset Analysis

Class B has the highest number of articles, as shown in the bar chart below:

![Class Distribution](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt1.PNG)

## Models and Accuracies

- **BioBERT:** Achieved an accuracy of 87%
- **RoBERTa:** Achieved an accuracy of 84%
- **XLNet:** Achieved an accuracy of 85.5%

BioBERT outperformed the other models, demonstrating its effectiveness in handling biomedical text data.

## Training Graphs of ROBERTA

### Training Loss vs Number of Epochs

![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt2.PNG)

### F1 Validation Accuracy vs Number of Epochs

![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt3.PNG)

### Flat Validation Accuracy vs Number of Epochs

![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt4.PNG)


## Training Graphs of BIO-BERT

### Training Loss vs Number of Epochs

![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt5.PNG)

### F1 Validation Accuracy vs Number of Epochs

![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt6.PNG)

### Flat Validation Accuracy vs Number of Epochs

![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt7.PNG)


## Training Graphs of XL-NET

### Training Loss vs Number of Epochs

![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt8.PNG)

### F1 Validation Accuracy vs Number of Epochs

![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt9.PNG)

### Flat Validation Accuracy vs Number of Epochs

![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt10.PNG)

## Result
The trained models, evaluation results, classification reports and additional details can be found in the results directory.

## Conclusion

In conclusion, this project successfully tackled the PubMed multi-label text classification problem using BioBERT, RoBERTa, and XLNet. BioBERT emerged as the most effective model, achieving the highest accuracy. The dataset analysis revealed Class B as the category with the highest number of articles.

The training graphs provide insights into the model's learning process, illustrating the reduction in training loss and the improvement in F1 and flat validation accuracies over epochs.

## Acknowledgement

Thanks to Kaggle for Providing the Dataset , Maintainers and IWOC for this beautiful opportunity

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions PubMed Text Classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# PubMed Multi-label Text Classification Project Overview

## Overview

This project focuses on multi-label text classification of biomedical research articles sourced from the PubMed repository. The dataset consists of approximately 50,000 documents, each manually annotated with MeSH (Medical Subject Headings) labels by Biomedical Experts. The primary challenge addressed in this project is the extensive output space and severe label sparsity issues in the original dataset.

## Key Features

- **Dataset:** 50,000 research articles from PubMed, annotated with MeSH labels.
- **Label Reduction:** Original MeSH labels reduced to major categories.
- **Root Mapping:** Major labels mapped to corresponding root categories for a simplified output space.
- **Models:** Utilized BioBERT, RoBERTa, and XLNet for multi-label text classification.
- **Results:** BioBERT demonstrated the highest accuracy at 87%, outperforming RoBERTa and XLNet.

## Dataset Processing

The dataset underwent preprocessing to address label sparsity and reduce the output space. Major MeSH labels were retained, and a hierarchical structure was introduced through root mapping for better interpretability.

## Model Training

- Three state-of-the-art pre-trained language models were employed: BioBERT, RoBERTa, and XLNet.
- Training involved optimization for multi-label classification with a focus on improving F1 and flat validation accuracies.
- Graphs depicting training loss, F1 validation accuracy, and flat validation accuracy over epochs are available in the `results` directory.
- the training runs charts can be seen in weughts and Biases Dashboard: https://wandb.ai/ai-guru/Multi%20Label%20Classification%20of%20PubMed%20Articles%20%28Paper%20Night%20Presentation%29

## Conclusion

BioBERT emerged as the most effective model, achieving an accuracy of 87%. The dataset analysis revealed Class B as the category with the highest number of articles. Training graphs illustrate the models' learning processes and performance improvements over epochs.



19 changes: 19 additions & 0 deletions PubMed Text Classification/Results/Classification_Report.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
precision,recall,f1-score,support
0.7848658161020677,0.774137556953786,0.7794647733478972,4609.0
0.9580756734094958,0.9882162162162162,0.9729125645255707,9250.0
0.8690929878048781,0.8761044948136766,0.8725846565907788,5206.0
0.9108230516945182,0.9317782393353571,0.9211814879166008,6259.0
0.8145511180331516,0.9413731036256107,0.8733822389216914,7778.0
0.8653555219364599,0.6474250141482739,0.7406927808352218,1767.0
0.8226287262872629,0.892925430210325,0.8563368361661613,6799.0
0.6289308176100629,0.0819000819000819,0.14492753623188406,1221.0
0.7114285714285714,0.46629213483146065,0.5633484162895926,1068.0
0.7222222222222222,0.43333333333333335,0.5416666666666666,1110.0
0.751412429378531,0.3568075117370892,0.48385629831741706,1491.0
0.8826775431861804,0.869328922495274,0.8759523809523809,4232.0
0.8332532436328688,0.7535853976531942,0.7914194431766316,4602.0
0.749098774333093,0.6668806161745828,0.7056027164685909,1558.0
0.8565701800321421,0.8329411764705882,0.8445904441417621,56950.0
0.807458321218526,0.6914348609591616,0.7230949140290776,56950.0
0.8489845853503445,0.8329411764705882,0.8318580249612143,56950.0
0.8582284054834055,0.8397756632256632,0.8357095510578174,56950.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
precision,recall,f1-score,support
0.7828489562013917,0.8298980256020829,0.8056872037914693,4609.0
0.9597821533305404,0.9907027027027027,0.9749973401425683,9250.0
0.8788796366389099,0.892047637341529,0.8854146806482364,5206.0
0.9180457052797478,0.9306598498162646,0.9243097429387496,6259.0
0.8040067839728641,0.9751864232450501,0.881361840576342,7778.0
0.8503401360544217,0.7074136955291455,0.772320049428483,1767.0
0.8333104678282344,0.8933666715693485,0.8622941510505395,6799.0
0.5815602836879432,0.13431613431613432,0.21823020625415834,1221.0
0.7167070217917676,0.5543071161048689,0.6251319957761353,1068.0
0.7475,0.5387387387387388,0.6261780104712042,1110.0
0.7105809128630706,0.45942320590207913,0.5580448065173116,1491.0
0.8770189201661283,0.8981568998109641,0.8874620593042261,4232.0
0.8198771889924948,0.7833550630160799,0.8012001333481498,4602.0
0.7543103448275862,0.6739409499358151,0.7118644067796611,1558.0
0.8534698083876264,0.8579806848112379,0.8557193019325574,56950.0
0.8024834651167928,0.7329652224022002,0.7524640447876597,56950.0
0.847304453554072,0.8579806848112379,0.8467545343162335,56950.0
0.8554871608946609,0.8634571867021866,0.8469379506913796,56950.0
19 changes: 19 additions & 0 deletions PubMed Text Classification/Results/Classification_Report_XLNET.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
precision,recall,f1-score,support
0.8322997416020672,0.6988500759383814,0.7597594055902818,4609.0
0.9641336739037889,0.9793513513513513,0.971682934677679,9250.0
0.8563110443275732,0.8757203227045717,0.8659069325735992,5206.0
0.9193157979667581,0.9102092986100015,0.9147398843930635,6259.0
0.8097951133998028,0.9502442787348933,0.8744158532978408,7778.0
0.7568306010928961,0.7838143746462931,0.7700861829302196,1767.0
0.836130867709815,0.8645389027798206,0.8500976209414997,6799.0
0.5080128205128205,0.2596232596232596,0.3436314363143631,1221.0
0.6659815005138746,0.6067415730337079,0.6349828515433612,1068.0
0.662148070907195,0.5720720720720721,0.6138231029482841,1110.0
0.5675487465181058,0.5466130114017438,0.5568841817560641,1491.0
0.8777540867093105,0.8754725897920604,0.8766118537797231,4232.0
0.8040313549832027,0.7800956106040852,0.7918826513731112,4602.0
0.7618699780861943,0.6694480102695763,0.7126750939528528,1558.0
0.8437549497544922,0.8418437225636524,0.8427982526302837,56950.0
0.7730116713023861,0.7409139093972729,0.7526557132908531,56950.0
0.8393603363514722,0.8418437225636524,0.8382552404240644,56950.0
0.8474893975468977,0.8457395340770341,0.8317706467175531,56950.0
59 changes: 59 additions & 0 deletions PubMed Text Classification/Results/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
{
"_name_or_path": "distilroberta-base",
"architectures": [
"RobertaForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2",
"3": "LABEL_3",
"4": "LABEL_4",
"5": "LABEL_5",
"6": "LABEL_6",
"7": "LABEL_7",
"8": "LABEL_8",
"9": "LABEL_9",
"10": "LABEL_10",
"11": "LABEL_11",
"12": "LABEL_12",
"13": "LABEL_13"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_10": 10,
"LABEL_11": 11,
"LABEL_12": 12,
"LABEL_13": 13,
"LABEL_2": 2,
"LABEL_3": 3,
"LABEL_4": 4,
"LABEL_5": 5,
"LABEL_6": 6,
"LABEL_7": 7,
"LABEL_8": 8,
"LABEL_9": 9
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.24.0",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265
}
Loading
Loading