This repository represents 5 models that were experimented on the DRSM-Corpus. The models mentioned in the below sections are some variation of transformers and neural networks. Our team (Team CUNI-NU) implemented similar approach in the Track -5 of Biocreative VII conference. This shared task was a multi-label classification of Covid-19 literature annotation. As the DRSM-Corpus is a medical dataset, we implemented similar techniques to acheive optimal results.
We have also given the minimum setup that will be required for running the noteboooks attached.
These notebooks were build on Google Colab with the following configuration:
- GPU: Nvidia Tesla V100sxm2
- GPU Memory: 16160MiB
Data is retrieved from the DRSM-Corpus. This is a annotated literature corpus for NLP studies of 'Disease Research State' based on different categories of research (DRSM
stands for Disease Research State Model
)
The initial-gold-standards
has following set of columns:
ID_PAPER
, TITLE
, ABSTRACT
, PRIMARY CATEGORY
, SECONDARY CATEGORY
, IRRELEVANT
, DISEASE_NAME
Descriptions
- Size -
8919
- Unique classes -
clinical characteristics or disease pathology
,other
,disease mechanism
,therapeutics in the clinic
,irrelevant
,patient-based therapeutics
- Classes related statistics:
Disease | Count |
---|---|
irrelevant | 109 |
patient-based therapeutics | 342 |
other | 342 |
therapeutics in the clinic | 1166 |
disease mechanism | 2801 |
clinical characteristics or disease pathology | 4166 |
We have experimented using 5 different variations of transformers model. These variations are the experiments of the combination of various state-of-the-art BERT models and neural networks. One major component of many of these models is label-wise-attention(LWAN) network. LWAN architecture is responsible for improving individual word predictability by paying particular attention to the output labels. It uses an attention-mechanism-like strategy to allow the model to focus on specific words in the input rather than memorizing all of the essential features in a fixed-length vector.
- BioBERT: In this method we have trained the base BioBERT model.
- PubMedBERT-LWAN: In this method we have trained PubMedBERT along with LWAN.
- Specter-LWAN: In this method we have trained the base SPECTER model.
- Specter dual-attention LWAN: In this method we have used Specter embeddings with a dual-attention module. The link for this paper can be found here.
Model | Micro F1 score | Checkpoints | Notebooks |
---|---|---|---|
BioBERT | 0.8995 | Link | |
PubMedBERT-LWAN | 0.9087 | Link | |
Specter-LWAN | 0.9011 | Link | |
Specter dual-attention LWAN | 0.9109 | Link |
Below we have also attached the label-wise score from our best performing model i.e. Specter dual-attention LWAN.
Here, it is clearly visible that disease mechanism
, therapeutics in the clinic
and irrelevant
classes have very less instances in the test dataset. Because of this large imblance, specially in case of disease mechanism
our model may not give optimal results. One way to solve this probelm is to get more annotated data and try to maintain equal number of instances for each label. Another solution can be to ideate on a weighted approach for classification that can attentuate the problem caused the imbalance imbalance.
As these notebooks were implemented using Google Colab, there is a basic setup required to run these notebooks. We recommend using google colab for avoinding any complications.
- Step 1. Upload the notebook on google colab or use the link provided in the above table and enable the GPU configuration.
- Step 2. Install all necessary dependencies that are mentioned in the initial cell blocks
- Step 3. Connect the notebook to your google drive. you can see the tutorial here
- Step 3. Download data using the "wget" command. There is one cell dedicated to this command.
- Step 4. The downloaded data will be saved in the
content
directory which is the runtime folder. Make sure you save this data in your google drive as it will be deleted once the colab session expires. - Step 6. For every implementation we have provided the model checkpoints link so that the testing can be done easily. To use these checkpoints, download them from the above table and upload them into your connected google drive. After uploading them into your google drive you can enter it's path in the notebook.
- Step 5. After successful implementation of the above steps you can follow the instructions given in the notebook to get to the end result.
- Tirthankar Ghosal (Oak Ridge National Laboratory, US): tghosal@acm.org
- Aakash Bhatnagar (Navrachana University, Gujarat, India): akashbharat.bhatnagar@gmail.com
- Nidhir Bhavsar (Navrachana University, Gujarat, India): nidbhavsar989@gmail.com