1. Description

Evaluating Tokenizers Impact on OOVs Representation with Transformers Models

1. Description

Code source of the paper Evaluating Tokenizaer Impact on OOVs Representation with Transformers Models to be presented at LREC 2022.

Abstract

Transformer models have achieved significant improvements in multiple downstream tasks in recent years. One of the main contributions of Transformers is their ability to create new representations for out-of-vocabulary (OOV) words. In this paper, we have evaluated three categories of OOVs: (A) new domain-specific terms (e.g., "eucaryote" in microbiology), (B) misspelled words containing typos, and (C) cross-domain homographs (e.g., "arm" has different meanings in a clinical trial and in anatomy). We use three French domain-specific datasets on the legal, medical, and energetical domains to robustly analyze these categories. Our experiments have led to exciting findings that showed: (1) It is easier to improve the representation of new words (A and B) than it is for words that already exist in the vocabulary of the Transformer models (C), (2) To ameliorate the representation of OOVs, the most effective method relies on adding external morpho-syntactic context rather than improving the semantic understanding of the words directly (fine-tuning) and (3) We cannot foresee the impact of minor misspellings in words because similar misspellings have different impacts on their representation. We believe that tackling the challenges of processing OOVs regarding their specificities will significantly help the domain adaptation aspect of BERT.

2. Installation

Install the package with pip:

pip install .

Install the package with conda:

conda install -f environment.yml

Make sure all the dependencies have been installed using:

pip install -r requirements.txt

3. How to ?

Gallica Extractor

To extract journals from gallica, use the script src/data/gallica.py. In the original paper, we used "Journal of Microbiology" that you can change in the script (line 45):

PRESSE_MEDICALE = [("journal_microbiologie", "http://gallica.bnf.fr/ark:/12148/cb34348753q/date", 1887, 1900)]

You can add other newspapers using the format ("name of the paper", "gallica link", "start date", "end date").

Preprocess Data

Finetune Language Model

Cosine Similarity

Evaluation Metrics

Two evaluation metrics are implemented in the file src/evaluating_tokenizers_oov/eval/eval_metrics.py : Dice coefficient and Dice-SU coefficient, presented in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
src/evaluating_tokenizers_oov		src/evaluating_tokenizers_oov
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.rst		CHANGELOG.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Tokenizers Impact on OOVs Representation with Transformers Models

1. Description

Abstract

2. Installation

3. How to ?

Gallica Extractor

Preprocess Data

Finetune Language Model

Cosine Similarity

Evaluation Metrics

About

Releases

Packages

Languages

License

alexandrabenamar/evaluating_tokenizers_oov

Folders and files

Latest commit

History

Repository files navigation

Evaluating Tokenizers Impact on OOVs Representation with Transformers Models

1. Description

Abstract

2. Installation

3. How to ?

Gallica Extractor

Preprocess Data

Finetune Language Model

Cosine Similarity

Evaluation Metrics

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages