Code source of the paper Evaluating Tokenizaer Impact on OOVs Representation with Transformers Models
to be presented at LREC 2022.
Transformer models have achieved significant improvements in multiple downstream tasks in recent years. One of the main contributions of Transformers is their ability to create new representations for out-of-vocabulary (OOV) words. In this paper, we have evaluated three categories of OOVs: (A) new domain-specific terms (e.g., "eucaryote" in microbiology), (B) misspelled words containing typos, and (C) cross-domain homographs (e.g., "arm" has different meanings in a clinical trial and in anatomy). We use three French domain-specific datasets on the legal, medical, and energetical domains to robustly analyze these categories. Our experiments have led to exciting findings that showed: (1) It is easier to improve the representation of new words (A and B) than it is for words that already exist in the vocabulary of the Transformer models (C), (2) To ameliorate the representation of OOVs, the most effective method relies on adding external morpho-syntactic context rather than improving the semantic understanding of the words directly (fine-tuning) and (3) We cannot foresee the impact of minor misspellings in words because similar misspellings have different impacts on their representation. We believe that tackling the challenges of processing OOVs regarding their specificities will significantly help the domain adaptation aspect of BERT.
Install the package with pip:
pip install .
Install the package with conda:
conda install -f environment.yml
Make sure all the dependencies have been installed using:
pip install -r requirements.txt
To extract journals from gallica, use the script src/data/gallica.py
. In the original paper, we used "Journal of Microbiology" that you can change in the script (line 45):
PRESSE_MEDICALE = [("journal_microbiologie", "http://gallica.bnf.fr/ark:/12148/cb34348753q/date", 1887, 1900)]
You can add other newspapers using the format ("name of the paper", "gallica link", "start date", "end date").
Two evaluation metrics are implemented in the file src/evaluating_tokenizers_oov/eval/eval_metrics.py
: Dice coefficient and Dice-SU coefficient, presented in the paper.