GitHub - Nid989/Experiments-on-DRSM-corpus: Implementation on DRSM-Corpus

Introduction

This repository represents 5 models that were experimented on the DRSM-Corpus. The models mentioned in the below sections are some variation of transformers and neural networks. Our team (Team CUNI-NU) implemented similar approach in the Track -5 of Biocreative VII conference. This shared task was a multi-label classification of Covid-19 literature annotation. As the DRSM-Corpus is a medical dataset, we implemented similar techniques to acheive optimal results.

We have also given the minimum setup that will be required for running the noteboooks attached.

These notebooks were build on Google Colab with the following configuration:

GPU: Nvidia Tesla V100sxm2
GPU Memory: 16160MiB

Data

Data is retrieved from the DRSM-Corpus. This is a annotated literature corpus for NLP studies of 'Disease Research State' based on different categories of research (DRSM stands for Disease Research State Model) The initial-gold-standards has following set of columns: ID_PAPER, TITLE, ABSTRACT, PRIMARY CATEGORY, SECONDARY CATEGORY, IRRELEVANT, DISEASE_NAME

Descriptions

Size - 8919
Unique classes - clinical characteristics or disease pathology, other, disease mechanism, therapeutics in the clinic, irrelevant, patient-based therapeutics
Classes related statistics:

Disease	Count
irrelevant	109
patient-based therapeutics	342
other	342
therapeutics in the clinic	1166
disease mechanism	2801
clinical characteristics or disease pathology	4166

Methods

We have experimented using 5 different variations of transformers model. These variations are the experiments of the combination of various state-of-the-art BERT models and neural networks. One major component of many of these models is label-wise-attention(LWAN) network. LWAN architecture is responsible for improving individual word predictability by paying particular attention to the output labels. It uses an attention-mechanism-like strategy to allow the model to focus on specific words in the input rather than memorizing all of the essential features in a fixed-length vector.

BioBERT: In this method we have trained the base BioBERT model.
PubMedBERT-LWAN: In this method we have trained PubMedBERT along with LWAN.
Specter-LWAN: In this method we have trained the base SPECTER model.
Specter dual-attention LWAN: In this method we have used Specter embeddings with a dual-attention module. The link for this paper can be found here.

Model	Micro F1 score	Checkpoints
BioBERT	0.8995	Link
PubMedBERT-LWAN	0.9087	Link
Specter-LWAN	0.9011	Link
Specter dual-attention LWAN	0.9109	Link

Below we have also attached the label-wise score from our best performing model i.e. Specter dual-attention LWAN.

Here, it is clearly visible that disease mechanism, therapeutics in the clinic and irrelevant classes have very less instances in the test dataset. Because of this large imblance, specially in case of disease mechanism our model may not give optimal results. One way to solve this probelm is to get more annotated data and try to maintain equal number of instances for each label. Another solution can be to ideate on a weighted approach for classification that can attentuate the problem caused the imbalance imbalance.

Setup

As these notebooks were implemented using Google Colab, there is a basic setup required to run these notebooks. We recommend using google colab for avoinding any complications.

Step 1. Upload the notebook on google colab or use the link provided in the above table and enable the GPU configuration.
Step 2. Install all necessary dependencies that are mentioned in the initial cell blocks
Step 3. Connect the notebook to your google drive. you can see the tutorial here
Step 3. Download data using the "wget" command. There is one cell dedicated to this command.
Step 4. The downloaded data will be saved in the content directory which is the runtime folder. Make sure you save this data in your google drive as it will be deleted once the colab session expires.
Step 6. For every implementation we have provided the model checkpoints link so that the testing can be done easily. To use these checkpoints, download them from the above table and upload them into your connected google drive. After uploading them into your google drive you can enter it's path in the notebook.
Step 5. After successful implementation of the above steps you can follow the instructions given in the notebook to get to the end result.

Contributors

Tirthankar Ghosal (Oak Ridge National Laboratory, US): tghosal@acm.org
Aakash Bhatnagar (Navrachana University, Gujarat, India): akashbharat.bhatnagar@gmail.com
Nidhir Bhavsar (Navrachana University, Gujarat, India): nidbhavsar989@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Models		Models
.DS_Store		.DS_Store
DRSM-experiments.pdf		DRSM-experiments.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Data

Methods

Setup

Contributors

About

Releases

Packages

Contributors 3

Languages

License

Nid989/Experiments-on-DRSM-corpus

Folders and files

Latest commit

History

Repository files navigation

Introduction

Data

Methods

Setup

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages