Please press ⭐ button and/or cite papers if you feel helpful.
This repository contains scripts for automatic speech recognition (ASR) and named entity recognition (NER) using sequence-to-sequence (seq2seq) models and BERT-based models. The provided scripts cover model preparation, training, inference, and evaluation processes, based on the dataset VietMed-NER.
HuggingFace, Paperswithcodes comes soon!
- Python 3.8 or higher
- PyTorch
- Transformers
- Datasets
- Tqdm
- Fire
- Loguru
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Install the required packages:
pip install -r requirements.txt
-
Download and prepare the datasets:
- The scripts expect datasets to be loaded using the
datasets
library. Ensure you have access to the required datasets.
- The scripts expect datasets to be loaded using the
-
Prepare the model and tokenizer:
- Update the
model_name
and other configurations inseq2seq_models.py
as needed.
- Update the
-
Run the training script:
python seq2seq_models.py --train --model_name <model-name>
-
Prepare the model and tokenizer:
- Update the
model_name
and other configurations inbert_based_models.py
as needed.
- Update the
-
Run the training script:
python bert_based_models.py --train --model_name <model-name>
- Note that we merged a slow tokenizer for PhoBERT into the main
transformers
branch. The process of merging a fast tokenizer for PhoBERT is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might installtransformers
as follows:
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
- Install tokenizers with pip:
pip3 install tokenizers
-
Prepare the model and tokenizer:
- Update the
model_path
inasr_infer_seq2seq.py
with the path to your seq2seq model.
- Update the
-
Run the inference script:
python asr_infer_seq2seq.py --model_path <path-to-model>
-
Prepare the model and tokenizer:
- Update the
model_path
inasr_infer_bert.py
with the path to your BERT-based model.
- Update the
-
Run the inference script:
python asr_infer_bert.py --model_path <path-to-model>
- The evaluation metrics are computed using the
slue.py
andmodified_seqeval.py
scripts. - Ensure the scripts are imported correctly in the inference scripts for evaluation.
-
To train a seq2seq model:
python seq2seq_models.py --train --model_name facebook/mbart-large-50
-
To train a BERT-based model:
python bert_based_models.py --train --model_name bert-base-multilingual-cased
-
To perform ASR inference using a seq2seq model:
python asr_infer_seq2seq.py --model_path /path/to/seq2seq_model
-
To perform ASR inference using a BERT-based model:
python asr_infer_bert.py --model_path /path/to/bert_model
Core developers:
Khai Le-Duc
University of Toronto, Canada
Email: duckhai.le@mail.utoronto.ca
GitHub: https://github.com/leduckhai
Hung-Phong Tran
Hanoi University of Science and Technology, Vietnam
GitHub: https://github.com/hungphongtrn