Skip to content

mobashgr/WeLT-impact-on-BioNEL

Repository files navigation

BioNER

BioNER Code is adapted from WeLT: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning

Installation

Dependencies

  • Python (>=3.6)
  • Pytorch (>=1.2.0)
  1. Clone this GitHub repository
  2. Navigate to the BioNER folder and install all necessary dependencies: python3 -m pip install -r requirements.txt
    Note: To install the appropriate torch, follow the download instructions based on your development environment.

Data Preparation

NER Datasets

Dataset Source
  • NCBI-disease
  • BC5CDR-disease
  • BC5CDR-chem
NER datasets are directly retrieved from BioBERT via this link
  • BioRED-Dis
  • BioRED-Chem
We have extended the aforementioned NER datasets to include BioRED. To convert from BioC XML / JSON to conll, we used bconv and filtered the chemical and disease entities.

Data & Evaluation code Download
To directly download NER datasets for fine-tuning models from scratch, use download.sh or manually download them via this link in main directory, unzip datasets.zip and rm -r datasets.zip The same instructions are used for the evaluation code.

Data Pre-processing
We adapted the preprocessing.sh from BioBERT to include BioRED

Reproducing Paper's results

We conducted the experiments on two different BERT models using the WELT weighting scheme. We have compared WELT against the corresponding traditional fine-tuning approaches(i.e. BioBERT fine-tuning). We explain the WeLT fine-tuning approach. We provide all the fine-tuned models on Huggingface, an example of fine-tuning from scratch using WeLT, and an example of predicting and evaluating disease entities.

1. Fine-tuning BERT Models

Our experimental work focused on BioBERT(mixed/continual pre-trained language model) & PubMedBERT(domain-specific/trained from scratch pre-trained language model), however, WELT can be adapted to other transformers like ELECTRA.

Model Used version in HF 🤗
BioBERT model_name_or_path
PubMedBERT model_name_or_path

2. WeLT fine-tuning

We have adopted BioBERT-run_ner.py to develop a cost-sensitive trainer in run_weight_scheme.py that extends Trainer class to WeightedLossTrainer and override compute_loss function to include WELT in weighted Cross-Entropy loss function

3. Building XML files

After fine-tuning BERT models, we recognize chemical & disease entites via ner.py. The output files are in predicted path directory

Evaluation
We have used the strict and approximate evaluation of BioCreative VII Track 2 - NLM-CHEM track Full-text Chemical Identification and Indexing in PubMed articles

Quick Links

Citation

The manuscript is in preparation (TBD)

Authors

Authors: Ghadeer Mobasher*, Pedro Ruas, Francisco M. Couto, Olga Krebs, Michael Gertz and Wolfgang Müller

Acknowledgment

Ghadeer Mobasher is part of the PoLiMeR-ITN (http://polimer-itn.eu/) and is supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement PoLiMeR, No 81261

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published