Skip to content

Commit

Permalink
Merge pull request #566 from adi271001/PubMed-Text
Browse files Browse the repository at this point in the history
PubMed Text Classification MeSH Model
  • Loading branch information
abhisheks008 authored Feb 5, 2024
2 parents 931a47b + bd78fec commit b0f9006
Show file tree
Hide file tree
Showing 28 changed files with 129,657 additions and 0 deletions.
33 changes: 33 additions & 0 deletions PubMed Text Classification/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# PubMed Multi-label Text Classification Dataset

This dataset consists of approximately 50,000 research articles sourced from the PubMed repository. The documents in this collection have been manually annotated by Biomedical Experts with their MeSH (Medical Subject Headings) labels. Each article is described in terms of 10-15 MeSH labels.

## Dataset Details

- **Original Annotation:** Biomedical Experts manually annotated the documents with MeSH labels.
- **Label Space:** The original dataset had a large number of MeSH labels, resulting in an extensive output space and severe label sparsity issues.
- **MeSH Major Labels:** Each article is annotated with MeSH major labels, reducing the label space and addressing label sparsity.

## Data Processing and Label Mapping

To overcome the challenges of an extremely large output space and severe label sparsity, the dataset has undergone processing and mapping to its root labels. The following steps were taken:

1. **Label Reduction:** The original MeSH labels were reduced to their major categories.
2. **Root Mapping:** The major labels were mapped to their corresponding root categories to simplify the output space.

## Label Hierarchy

The MeSH major labels in the dataset have been organized in a hierarchical structure, allowing for a more structured and interpretable representation of the biomedical concepts.

## Data Statistics

- **Number of Articles:** 50,000
- **MeSH Labels:** Originally 10-15 per article, reduced to major labels.
- **Root Labels:** Reduced and mapped categories for more manageable classification.


## Acknowledgments

I extend my gratitude to the Kaggle and Biomedical Experts who manually annotated the documents and contributed this dataset to Kaggle.


Binary file added PubMed Text Classification/Images/pmt1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt10.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt3.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt4.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt5.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt6.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt7.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt8.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added PubMed Text Classification/Images/pmt9.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
92 changes: 92 additions & 0 deletions PubMed Text Classification/Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# PubMed Multi-label Text Classification

This repository contains code and models for multi-label text classification on the PubMed dataset using BioBERT, RoBERTa, and XLNet.

## Table of Contents

- [Dataset](#dataset)
- [Dataset Analysis](#dataset-analysis)
- [Models and Accuracies](#models-and-accuracies)
- [Training Graphs](#training-graphs)
- [Training Loss vs Number of Epochs](#training-loss-vs-number-of-epochs)
- [F1 Validation Accuracy vs Number of Epochs](#f1-validation-accuracy-vs-number-of-epochs)
- [Flat Validation Accuracy vs Number of Epochs](#flat-validation-accuracy-vs-number-of-epochs)
- [Conclusion](#conclusion)
- [Results](#results)
- [Acknowledgments](#acknowledgments)


## Dataset

The dataset used in this project is available on Kaggle: [PubMed Multi-label Text Classification](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification)

### Dataset Analysis

Class B has the highest number of articles, as shown in the bar chart below:

![Class Distribution](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt1.PNG)

## Models and Accuracies

- **BioBERT:** Achieved an accuracy of 87%
- **RoBERTa:** Achieved an accuracy of 84%
- **XLNet:** Achieved an accuracy of 85.5%

BioBERT outperformed the other models, demonstrating its effectiveness in handling biomedical text data.

## Training Graphs of ROBERTA

### Training Loss vs Number of Epochs

![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt2.PNG)

### F1 Validation Accuracy vs Number of Epochs

![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt3.PNG)

### Flat Validation Accuracy vs Number of Epochs

![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt4.PNG)


## Training Graphs of BIO-BERT

### Training Loss vs Number of Epochs

![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt5.PNG)

### F1 Validation Accuracy vs Number of Epochs

![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt6.PNG)

### Flat Validation Accuracy vs Number of Epochs

![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt7.PNG)


## Training Graphs of XL-NET

### Training Loss vs Number of Epochs

![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt8.PNG)

### F1 Validation Accuracy vs Number of Epochs

![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt9.PNG)

### Flat Validation Accuracy vs Number of Epochs

![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt10.PNG)

## Result
The trained models, evaluation results, classification reports and additional details can be found in the results directory.

## Conclusion

In conclusion, this project successfully tackled the PubMed multi-label text classification problem using BioBERT, RoBERTa, and XLNet. BioBERT emerged as the most effective model, achieving the highest accuracy. The dataset analysis revealed Class B as the category with the highest number of articles.

The training graphs provide insights into the model's learning process, illustrating the reduction in training loss and the improvement in F1 and flat validation accuracies over epochs.

## Acknowledgement

Thanks to Kaggle for Providing the Dataset , Maintainers and IWOC for this beautiful opportunity

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions PubMed Text Classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# PubMed Multi-label Text Classification Project Overview

## Overview

This project focuses on multi-label text classification of biomedical research articles sourced from the PubMed repository. The dataset consists of approximately 50,000 documents, each manually annotated with MeSH (Medical Subject Headings) labels by Biomedical Experts. The primary challenge addressed in this project is the extensive output space and severe label sparsity issues in the original dataset.

## Key Features

- **Dataset:** 50,000 research articles from PubMed, annotated with MeSH labels.
- **Label Reduction:** Original MeSH labels reduced to major categories.
- **Root Mapping:** Major labels mapped to corresponding root categories for a simplified output space.
- **Models:** Utilized BioBERT, RoBERTa, and XLNet for multi-label text classification.
- **Results:** BioBERT demonstrated the highest accuracy at 87%, outperforming RoBERTa and XLNet.

## Dataset Processing

The dataset underwent preprocessing to address label sparsity and reduce the output space. Major MeSH labels were retained, and a hierarchical structure was introduced through root mapping for better interpretability.

## Model Training

- Three state-of-the-art pre-trained language models were employed: BioBERT, RoBERTa, and XLNet.
- Training involved optimization for multi-label classification with a focus on improving F1 and flat validation accuracies.
- Graphs depicting training loss, F1 validation accuracy, and flat validation accuracy over epochs are available in the `results` directory.
- the training runs charts can be seen in weughts and Biases Dashboard: https://wandb.ai/ai-guru/Multi%20Label%20Classification%20of%20PubMed%20Articles%20%28Paper%20Night%20Presentation%29

## Conclusion

BioBERT emerged as the most effective model, achieving an accuracy of 87%. The dataset analysis revealed Class B as the category with the highest number of articles. Training graphs illustrate the models' learning processes and performance improvements over epochs.



19 changes: 19 additions & 0 deletions PubMed Text Classification/Results/Classification_Report.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
precision,recall,f1-score,support
0.7848658161020677,0.774137556953786,0.7794647733478972,4609.0
0.9580756734094958,0.9882162162162162,0.9729125645255707,9250.0
0.8690929878048781,0.8761044948136766,0.8725846565907788,5206.0
0.9108230516945182,0.9317782393353571,0.9211814879166008,6259.0
0.8145511180331516,0.9413731036256107,0.8733822389216914,7778.0
0.8653555219364599,0.6474250141482739,0.7406927808352218,1767.0
0.8226287262872629,0.892925430210325,0.8563368361661613,6799.0
0.6289308176100629,0.0819000819000819,0.14492753623188406,1221.0
0.7114285714285714,0.46629213483146065,0.5633484162895926,1068.0
0.7222222222222222,0.43333333333333335,0.5416666666666666,1110.0
0.751412429378531,0.3568075117370892,0.48385629831741706,1491.0
0.8826775431861804,0.869328922495274,0.8759523809523809,4232.0
0.8332532436328688,0.7535853976531942,0.7914194431766316,4602.0
0.749098774333093,0.6668806161745828,0.7056027164685909,1558.0
0.8565701800321421,0.8329411764705882,0.8445904441417621,56950.0
0.807458321218526,0.6914348609591616,0.7230949140290776,56950.0
0.8489845853503445,0.8329411764705882,0.8318580249612143,56950.0
0.8582284054834055,0.8397756632256632,0.8357095510578174,56950.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
precision,recall,f1-score,support
0.7828489562013917,0.8298980256020829,0.8056872037914693,4609.0
0.9597821533305404,0.9907027027027027,0.9749973401425683,9250.0
0.8788796366389099,0.892047637341529,0.8854146806482364,5206.0
0.9180457052797478,0.9306598498162646,0.9243097429387496,6259.0
0.8040067839728641,0.9751864232450501,0.881361840576342,7778.0
0.8503401360544217,0.7074136955291455,0.772320049428483,1767.0
0.8333104678282344,0.8933666715693485,0.8622941510505395,6799.0
0.5815602836879432,0.13431613431613432,0.21823020625415834,1221.0
0.7167070217917676,0.5543071161048689,0.6251319957761353,1068.0
0.7475,0.5387387387387388,0.6261780104712042,1110.0
0.7105809128630706,0.45942320590207913,0.5580448065173116,1491.0
0.8770189201661283,0.8981568998109641,0.8874620593042261,4232.0
0.8198771889924948,0.7833550630160799,0.8012001333481498,4602.0
0.7543103448275862,0.6739409499358151,0.7118644067796611,1558.0
0.8534698083876264,0.8579806848112379,0.8557193019325574,56950.0
0.8024834651167928,0.7329652224022002,0.7524640447876597,56950.0
0.847304453554072,0.8579806848112379,0.8467545343162335,56950.0
0.8554871608946609,0.8634571867021866,0.8469379506913796,56950.0
19 changes: 19 additions & 0 deletions PubMed Text Classification/Results/Classification_Report_XLNET.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
precision,recall,f1-score,support
0.8322997416020672,0.6988500759383814,0.7597594055902818,4609.0
0.9641336739037889,0.9793513513513513,0.971682934677679,9250.0
0.8563110443275732,0.8757203227045717,0.8659069325735992,5206.0
0.9193157979667581,0.9102092986100015,0.9147398843930635,6259.0
0.8097951133998028,0.9502442787348933,0.8744158532978408,7778.0
0.7568306010928961,0.7838143746462931,0.7700861829302196,1767.0
0.836130867709815,0.8645389027798206,0.8500976209414997,6799.0
0.5080128205128205,0.2596232596232596,0.3436314363143631,1221.0
0.6659815005138746,0.6067415730337079,0.6349828515433612,1068.0
0.662148070907195,0.5720720720720721,0.6138231029482841,1110.0
0.5675487465181058,0.5466130114017438,0.5568841817560641,1491.0
0.8777540867093105,0.8754725897920604,0.8766118537797231,4232.0
0.8040313549832027,0.7800956106040852,0.7918826513731112,4602.0
0.7618699780861943,0.6694480102695763,0.7126750939528528,1558.0
0.8437549497544922,0.8418437225636524,0.8427982526302837,56950.0
0.7730116713023861,0.7409139093972729,0.7526557132908531,56950.0
0.8393603363514722,0.8418437225636524,0.8382552404240644,56950.0
0.8474893975468977,0.8457395340770341,0.8317706467175531,56950.0
59 changes: 59 additions & 0 deletions PubMed Text Classification/Results/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
{
"_name_or_path": "distilroberta-base",
"architectures": [
"RobertaForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2",
"3": "LABEL_3",
"4": "LABEL_4",
"5": "LABEL_5",
"6": "LABEL_6",
"7": "LABEL_7",
"8": "LABEL_8",
"9": "LABEL_9",
"10": "LABEL_10",
"11": "LABEL_11",
"12": "LABEL_12",
"13": "LABEL_13"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_10": 10,
"LABEL_11": 11,
"LABEL_12": 12,
"LABEL_13": 13,
"LABEL_2": 2,
"LABEL_3": 3,
"LABEL_4": 4,
"LABEL_5": 5,
"LABEL_6": 6,
"LABEL_7": 7,
"LABEL_8": 8,
"LABEL_9": 9
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.24.0",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265
}
Loading

0 comments on commit b0f9006

Please sign in to comment.