Medical Spoken Named Entity Recognition

Please press ⭐ button and/or cite papers if you feel helpful.

This repository contains scripts for automatic speech recognition (ASR) and named entity recognition (NER) using sequence-to-sequence (seq2seq) models and BERT-based models. The provided scripts cover model preparation, training, inference, and evaluation processes, based on the dataset VietMed-NER.

Dataset and Pre-trained Models:

HuggingFace, Paperswithcodes comes soon!

Requirements

Python 3.8 or higher
PyTorch
Transformers
Datasets
Tqdm
Fire
Loguru

Setup

Clone the repository:

git clone <repository-url>
cd <repository-directory>

Install the required packages:
```
pip install -r requirements.txt
```
Download and prepare the datasets:
- The scripts expect datasets to be loaded using the datasets library. Ensure you have access to the required datasets.

Usage

Training

Seq2Seq Model Training

Prepare the model and tokenizer:
- Update the model_name and other configurations in seq2seq_models.py as needed.

Run the training script:

python seq2seq_models.py --train --model_name <model-name>

BERT-Based Model Training

Prepare the model and tokenizer:
- Update the model_name and other configurations in bert_based_models.py as needed.

Run the training script:

python bert_based_models.py --train --model_name <model-name>

Notes from the PhoBERT developers

Note that we merged a slow tokenizer for PhoBERT into the main transformers branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .

Install tokenizers with pip: pip3 install tokenizers

Inference

Seq2Seq Model Inference

Prepare the model and tokenizer:
- Update the model_path in asr_infer_seq2seq.py with the path to your seq2seq model.

Run the inference script:

python asr_infer_seq2seq.py --model_path <path-to-model>

BERT-Based Model Inference

Prepare the model and tokenizer:
- Update the model_path in asr_infer_bert.py with the path to your BERT-based model.

Run the inference script:

python asr_infer_bert.py --model_path <path-to-model>

Evaluation

The evaluation metrics are computed using the slue.py and modified_seqeval.py scripts.
Ensure the scripts are imported correctly in the inference scripts for evaluation.

Example

To train a seq2seq model:

python seq2seq_models.py --train --model_name facebook/mbart-large-50

To train a BERT-based model:

python bert_based_models.py --train --model_name bert-base-multilingual-cased

To perform ASR inference using a seq2seq model:

python asr_infer_seq2seq.py --model_path /path/to/seq2seq_model

To perform ASR inference using a BERT-based model:

python asr_infer_bert.py --model_path /path/to/bert_model

Contact

Core developers:

Khai Le-Duc

University of Toronto, Canada
Email: duckhai.le@mail.utoronto.ca
GitHub: https://github.com/leduckhai

Hung-Phong Tran

Hanoi University of Science and Technology, Vietnam
GitHub: https://github.com/hungphongtrn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Medical Spoken Named Entity Recognition

Dataset and Pre-trained Models:

Table of Contents

Requirements

Setup

Usage

Training

Seq2Seq Model Training

BERT-Based Model Training

Notes from the PhoBERT developers

Inference

Seq2Seq Model Inference

BERT-Based Model Inference

Evaluation

Example

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Medical Spoken Named Entity Recognition

Dataset and Pre-trained Models:

Table of Contents

Requirements

Setup

Usage

Training

Seq2Seq Model Training

BERT-Based Model Training

Notes from the PhoBERT developers

Inference

Seq2Seq Model Inference

BERT-Based Model Inference

Evaluation

Example

Contact