Merge pull request #566 from adi271001/PubMed-Text

PubMed Text Classification MeSH Model
abhisheks008 · Feb 5, 2024 · b0f9006 · b0f9006
2 parents 931a47b + bd78fec
commit b0f9006
Show file tree

Hide file tree

Showing 28 changed files with 129,657 additions and 0 deletions.
diff --git a/PubMed Text Classification/Dataset/README.md b/PubMed Text Classification/Dataset/README.md
@@ -0,0 +1,33 @@
+# PubMed Multi-label Text Classification Dataset
+
+This dataset consists of approximately 50,000 research articles sourced from the PubMed repository. The documents in this collection have been manually annotated by Biomedical Experts with their MeSH (Medical Subject Headings) labels. Each article is described in terms of 10-15 MeSH labels.
+
+## Dataset Details
+
+- **Original Annotation:** Biomedical Experts manually annotated the documents with MeSH labels.
+- **Label Space:** The original dataset had a large number of MeSH labels, resulting in an extensive output space and severe label sparsity issues.
+- **MeSH Major Labels:** Each article is annotated with MeSH major labels, reducing the label space and addressing label sparsity.
+
+## Data Processing and Label Mapping
+
+To overcome the challenges of an extremely large output space and severe label sparsity, the dataset has undergone processing and mapping to its root labels. The following steps were taken:
+
+1. **Label Reduction:** The original MeSH labels were reduced to their major categories.
+2. **Root Mapping:** The major labels were mapped to their corresponding root categories to simplify the output space.
+
+## Label Hierarchy
+
+The MeSH major labels in the dataset have been organized in a hierarchical structure, allowing for a more structured and interpretable representation of the biomedical concepts.
+
+## Data Statistics
+
+- **Number of Articles:** 50,000
+- **MeSH Labels:** Originally 10-15 per article, reduced to major labels.
+- **Root Labels:** Reduced and mapped categories for more manageable classification.
+
+
+## Acknowledgments
+
+I extend my gratitude to the Kaggle and Biomedical Experts who manually annotated the documents and contributed this dataset to Kaggle.
+
+
diff --git a/PubMed Text Classification/Images/pmt1.PNG b/PubMed Text Classification/Images/pmt1.PNG
diff --git a/PubMed Text Classification/Images/pmt10.PNG b/PubMed Text Classification/Images/pmt10.PNG
diff --git a/PubMed Text Classification/Images/pmt2.PNG b/PubMed Text Classification/Images/pmt2.PNG
diff --git a/PubMed Text Classification/Images/pmt3.PNG b/PubMed Text Classification/Images/pmt3.PNG
diff --git a/PubMed Text Classification/Images/pmt4.PNG b/PubMed Text Classification/Images/pmt4.PNG
diff --git a/PubMed Text Classification/Images/pmt5.PNG b/PubMed Text Classification/Images/pmt5.PNG
diff --git a/PubMed Text Classification/Images/pmt6.PNG b/PubMed Text Classification/Images/pmt6.PNG
diff --git a/PubMed Text Classification/Images/pmt7.PNG b/PubMed Text Classification/Images/pmt7.PNG
diff --git a/PubMed Text Classification/Images/pmt8.PNG b/PubMed Text Classification/Images/pmt8.PNG
diff --git a/PubMed Text Classification/Images/pmt9.PNG b/PubMed Text Classification/Images/pmt9.PNG
diff --git a/PubMed Text Classification/Model/README.md b/PubMed Text Classification/Model/README.md
@@ -0,0 +1,92 @@
+# PubMed Multi-label Text Classification
+
+This repository contains code and models for multi-label text classification on the PubMed dataset using BioBERT, RoBERTa, and XLNet.
+
+## Table of Contents
+
+- [Dataset](#dataset)
+  - [Dataset Analysis](#dataset-analysis)
+- [Models and Accuracies](#models-and-accuracies)
+- [Training Graphs](#training-graphs)
+  - [Training Loss vs Number of Epochs](#training-loss-vs-number-of-epochs)
+  - [F1 Validation Accuracy vs Number of Epochs](#f1-validation-accuracy-vs-number-of-epochs)
+  - [Flat Validation Accuracy vs Number of Epochs](#flat-validation-accuracy-vs-number-of-epochs)
+- [Conclusion](#conclusion)
+- [Results](#results)
+- [Acknowledgments](#acknowledgments)
+
+
+## Dataset
+
+The dataset used in this project is available on Kaggle: [PubMed Multi-label Text Classification](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification)
+
+### Dataset Analysis
+
+Class B has the highest number of articles, as shown in the bar chart below:
+
+![Class Distribution](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt1.PNG)
+
+## Models and Accuracies
+
+- **BioBERT:** Achieved an accuracy of 87%
+- **RoBERTa:** Achieved an accuracy of 84%
+- **XLNet:** Achieved an accuracy of 85.5%
+
+BioBERT outperformed the other models, demonstrating its effectiveness in handling biomedical text data.
+
+## Training Graphs of ROBERTA
+
+### Training Loss vs Number of Epochs
+
+![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt2.PNG)
+
+### F1 Validation Accuracy vs Number of Epochs
+
+![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt3.PNG)
+
+### Flat Validation Accuracy vs Number of Epochs
+
+![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt4.PNG)
+
+
+## Training Graphs of BIO-BERT
+
+### Training Loss vs Number of Epochs
+
+![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt5.PNG)
+
+### F1 Validation Accuracy vs Number of Epochs
+
+![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt6.PNG)
+
+### Flat Validation Accuracy vs Number of Epochs
+
+![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt7.PNG)
+
+
+## Training Graphs of XL-NET
+
+### Training Loss vs Number of Epochs
+
+![Training Loss](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt8.PNG)
+
+### F1 Validation Accuracy vs Number of Epochs
+
+![F1 Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt9.PNG)
+
+### Flat Validation Accuracy vs Number of Epochs
+
+![Flat Validation Accuracy](https://github.com/adi271001/ML-Crate/blob/PubMed-Text/PubMed%20Text%20Classification/Images/pmt10.PNG)
+
+## Result
+The trained models, evaluation results, classification reports and additional details can be found in the results directory.
+
+## Conclusion
+
+In conclusion, this project successfully tackled the PubMed multi-label text classification problem using BioBERT, RoBERTa, and XLNet. BioBERT emerged as the most effective model, achieving the highest accuracy. The dataset analysis revealed Class B as the category with the highest number of articles.
+
+The training graphs provide insights into the model's learning process, illustrating the reduction in training loss and the improvement in F1 and flat validation accuracies over epochs.
+
+## Acknowledgement
+
+Thanks to Kaggle for Providing the Dataset , Maintainers and IWOC for this beautiful opportunity
diff --git a/PubMed Text Classification/Model/pubmed-text-classification.ipynb b/PubMed Text Classification/Model/pubmed-text-classification.ipynb
diff --git a/PubMed Text Classification/README.md b/PubMed Text Classification/README.md
@@ -0,0 +1,31 @@
+# PubMed Multi-label Text Classification Project Overview
+
+## Overview
+
+This project focuses on multi-label text classification of biomedical research articles sourced from the PubMed repository. The dataset consists of approximately 50,000 documents, each manually annotated with MeSH (Medical Subject Headings) labels by Biomedical Experts. The primary challenge addressed in this project is the extensive output space and severe label sparsity issues in the original dataset.
+
+## Key Features
+
+- **Dataset:** 50,000 research articles from PubMed, annotated with MeSH labels.
+- **Label Reduction:** Original MeSH labels reduced to major categories.
+- **Root Mapping:** Major labels mapped to corresponding root categories for a simplified output space.
+- **Models:** Utilized BioBERT, RoBERTa, and XLNet for multi-label text classification.
+- **Results:** BioBERT demonstrated the highest accuracy at 87%, outperforming RoBERTa and XLNet.
+
+## Dataset Processing
+
+The dataset underwent preprocessing to address label sparsity and reduce the output space. Major MeSH labels were retained, and a hierarchical structure was introduced through root mapping for better interpretability.
+
+## Model Training
+
+- Three state-of-the-art pre-trained language models were employed: BioBERT, RoBERTa, and XLNet.
+- Training involved optimization for multi-label classification with a focus on improving F1 and flat validation accuracies.
+- Graphs depicting training loss, F1 validation accuracy, and flat validation accuracy over epochs are available in the `results` directory.
+- the training runs charts can be seen in weughts and Biases Dashboard:  https://wandb.ai/ai-guru/Multi%20Label%20Classification%20of%20PubMed%20Articles%20%28Paper%20Night%20Presentation%29
+
+## Conclusion
+
+BioBERT emerged as the most effective model, achieving an accuracy of 87%. The dataset analysis revealed Class B as the category with the highest number of articles. Training graphs illustrate the models' learning processes and performance improvements over epochs.
+
+
+
diff --git a/PubMed Text Classification/Results/Classification_Report.csv b/PubMed Text Classification/Results/Classification_Report.csv
@@ -0,0 +1,19 @@
+precision,recall,f1-score,support
+0.7848658161020677,0.774137556953786,0.7794647733478972,4609.0
+0.9580756734094958,0.9882162162162162,0.9729125645255707,9250.0
+0.8690929878048781,0.8761044948136766,0.8725846565907788,5206.0
+0.9108230516945182,0.9317782393353571,0.9211814879166008,6259.0
+0.8145511180331516,0.9413731036256107,0.8733822389216914,7778.0
+0.8653555219364599,0.6474250141482739,0.7406927808352218,1767.0
+0.8226287262872629,0.892925430210325,0.8563368361661613,6799.0
+0.6289308176100629,0.0819000819000819,0.14492753623188406,1221.0
+0.7114285714285714,0.46629213483146065,0.5633484162895926,1068.0
+0.7222222222222222,0.43333333333333335,0.5416666666666666,1110.0
+0.751412429378531,0.3568075117370892,0.48385629831741706,1491.0
+0.8826775431861804,0.869328922495274,0.8759523809523809,4232.0
+0.8332532436328688,0.7535853976531942,0.7914194431766316,4602.0
+0.749098774333093,0.6668806161745828,0.7056027164685909,1558.0
+0.8565701800321421,0.8329411764705882,0.8445904441417621,56950.0
+0.807458321218526,0.6914348609591616,0.7230949140290776,56950.0
+0.8489845853503445,0.8329411764705882,0.8318580249612143,56950.0
+0.8582284054834055,0.8397756632256632,0.8357095510578174,56950.0
diff --git a/PubMed Text Classification/Results/Classification_Report_BERTSEQ.csv b/PubMed Text Classification/Results/Classification_Report_BERTSEQ.csv
@@ -0,0 +1,19 @@
+precision,recall,f1-score,support
+0.7828489562013917,0.8298980256020829,0.8056872037914693,4609.0
+0.9597821533305404,0.9907027027027027,0.9749973401425683,9250.0
+0.8788796366389099,0.892047637341529,0.8854146806482364,5206.0
+0.9180457052797478,0.9306598498162646,0.9243097429387496,6259.0
+0.8040067839728641,0.9751864232450501,0.881361840576342,7778.0
+0.8503401360544217,0.7074136955291455,0.772320049428483,1767.0
+0.8333104678282344,0.8933666715693485,0.8622941510505395,6799.0
+0.5815602836879432,0.13431613431613432,0.21823020625415834,1221.0
+0.7167070217917676,0.5543071161048689,0.6251319957761353,1068.0
+0.7475,0.5387387387387388,0.6261780104712042,1110.0
+0.7105809128630706,0.45942320590207913,0.5580448065173116,1491.0
+0.8770189201661283,0.8981568998109641,0.8874620593042261,4232.0
+0.8198771889924948,0.7833550630160799,0.8012001333481498,4602.0
+0.7543103448275862,0.6739409499358151,0.7118644067796611,1558.0
+0.8534698083876264,0.8579806848112379,0.8557193019325574,56950.0
+0.8024834651167928,0.7329652224022002,0.7524640447876597,56950.0
+0.847304453554072,0.8579806848112379,0.8467545343162335,56950.0
+0.8554871608946609,0.8634571867021866,0.8469379506913796,56950.0
diff --git a/PubMed Text Classification/Results/Classification_Report_XLNET.csv b/PubMed Text Classification/Results/Classification_Report_XLNET.csv
@@ -0,0 +1,19 @@
+precision,recall,f1-score,support
+0.8322997416020672,0.6988500759383814,0.7597594055902818,4609.0
+0.9641336739037889,0.9793513513513513,0.971682934677679,9250.0
+0.8563110443275732,0.8757203227045717,0.8659069325735992,5206.0
+0.9193157979667581,0.9102092986100015,0.9147398843930635,6259.0
+0.8097951133998028,0.9502442787348933,0.8744158532978408,7778.0
+0.7568306010928961,0.7838143746462931,0.7700861829302196,1767.0
+0.836130867709815,0.8645389027798206,0.8500976209414997,6799.0
+0.5080128205128205,0.2596232596232596,0.3436314363143631,1221.0
+0.6659815005138746,0.6067415730337079,0.6349828515433612,1068.0
+0.662148070907195,0.5720720720720721,0.6138231029482841,1110.0
+0.5675487465181058,0.5466130114017438,0.5568841817560641,1491.0
+0.8777540867093105,0.8754725897920604,0.8766118537797231,4232.0
+0.8040313549832027,0.7800956106040852,0.7918826513731112,4602.0
+0.7618699780861943,0.6694480102695763,0.7126750939528528,1558.0
+0.8437549497544922,0.8418437225636524,0.8427982526302837,56950.0
+0.7730116713023861,0.7409139093972729,0.7526557132908531,56950.0
+0.8393603363514722,0.8418437225636524,0.8382552404240644,56950.0
+0.8474893975468977,0.8457395340770341,0.8317706467175531,56950.0
diff --git a/PubMed Text Classification/Results/config.json b/PubMed Text Classification/Results/config.json
@@ -0,0 +1,59 @@
+{
+  "_name_or_path": "distilroberta-base",
+  "architectures": [
+    "RobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5",
+    "6": "LABEL_6",
+    "7": "LABEL_7",
+    "8": "LABEL_8",
+    "9": "LABEL_9",
+    "10": "LABEL_10",
+    "11": "LABEL_11",
+    "12": "LABEL_12",
+    "13": "LABEL_13"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_10": 10,
+    "LABEL_11": 11,
+    "LABEL_12": 12,
+    "LABEL_13": 13,
+    "LABEL_2": 2,
+    "LABEL_3": 3,
+    "LABEL_4": 4,
+    "LABEL_5": 5,
+    "LABEL_6": 6,
+    "LABEL_7": 7,
+    "LABEL_8": 8,
+    "LABEL_9": 9
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.24.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}