Skip to content

Nid989/Experiments-on-DRSM-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License

Introduction

This repository represents 5 models that were experimented on the DRSM-Corpus. The models mentioned in the below sections are some variation of transformers and neural networks. Our team (Team CUNI-NU) implemented similar approach in the Track -5 of Biocreative VII conference. This shared task was a multi-label classification of Covid-19 literature annotation. As the DRSM-Corpus is a medical dataset, we implemented similar techniques to acheive optimal results.

We have also given the minimum setup that will be required for running the noteboooks attached.

These notebooks were build on Google Colab with the following configuration:

  • GPU: Nvidia Tesla V100sxm2
  • GPU Memory: 16160MiB

Data

Data is retrieved from the DRSM-Corpus. This is a annotated literature corpus for NLP studies of 'Disease Research State' based on different categories of research (DRSM stands for Disease Research State Model) The initial-gold-standards has following set of columns: ID_PAPER, TITLE, ABSTRACT, PRIMARY CATEGORY, SECONDARY CATEGORY, IRRELEVANT, DISEASE_NAME

Descriptions

  • Size - 8919
  • Unique classes - clinical characteristics or disease pathology, other, disease mechanism, therapeutics in the clinic, irrelevant, patient-based therapeutics
  • Classes related statistics:
Disease Count
irrelevant 109
patient-based therapeutics 342
other 342
therapeutics in the clinic 1166
disease mechanism 2801
clinical characteristics or disease pathology 4166

Methods

We have experimented using 5 different variations of transformers model. These variations are the experiments of the combination of various state-of-the-art BERT models and neural networks. One major component of many of these models is label-wise-attention(LWAN) network. LWAN architecture is responsible for improving individual word predictability by paying particular attention to the output labels. It uses an attention-mechanism-like strategy to allow the model to focus on specific words in the input rather than memorizing all of the essential features in a fixed-length vector.

  • BioBERT: In this method we have trained the base BioBERT model.
  • PubMedBERT-LWAN: In this method we have trained PubMedBERT along with LWAN.
  • Specter-LWAN: In this method we have trained the base SPECTER model.
  • Specter dual-attention LWAN: In this method we have used Specter embeddings with a dual-attention module. The link for this paper can be found here.
Model Micro F1 score Checkpoints Notebooks
BioBERT 0.8995 Link Open In Collab
PubMedBERT-LWAN 0.9087 Link Open In Collab
Specter-LWAN 0.9011 Link Open In Collab
Specter dual-attention LWAN 0.9109 Link Open In Collab

Below we have also attached the label-wise score from our best performing model i.e. Specter dual-attention LWAN.

image

Here, it is clearly visible that disease mechanism, therapeutics in the clinic and irrelevant classes have very less instances in the test dataset. Because of this large imblance, specially in case of disease mechanism our model may not give optimal results. One way to solve this probelm is to get more annotated data and try to maintain equal number of instances for each label. Another solution can be to ideate on a weighted approach for classification that can attentuate the problem caused the imbalance imbalance.

Setup

As these notebooks were implemented using Google Colab, there is a basic setup required to run these notebooks. We recommend using google colab for avoinding any complications.

  • Step 1. Upload the notebook on google colab or use the link provided in the above table and enable the GPU configuration.
  • Step 2. Install all necessary dependencies that are mentioned in the initial cell blocks
  • Step 3. Connect the notebook to your google drive. you can see the tutorial here
  • Step 3. Download data using the "wget" command. There is one cell dedicated to this command.
  • Step 4. The downloaded data will be saved in the content directory which is the runtime folder. Make sure you save this data in your google drive as it will be deleted once the colab session expires.
  • Step 6. For every implementation we have provided the model checkpoints link so that the testing can be done easily. To use these checkpoints, download them from the above table and upload them into your connected google drive. After uploading them into your google drive you can enter it's path in the notebook.
  • Step 5. After successful implementation of the above steps you can follow the instructions given in the notebook to get to the end result.

Contributors

  1. Tirthankar Ghosal (Oak Ridge National Laboratory, US): tghosal@acm.org
  2. Aakash Bhatnagar (Navrachana University, Gujarat, India): akashbharat.bhatnagar@gmail.com
  3. Nidhir Bhavsar (Navrachana University, Gujarat, India): nidbhavsar989@gmail.com

About

Implementation on DRSM-Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published