BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition - Schneider et al. - 2020

📌 Paper

tl;dr

Our goal is to assess a deep contextual embedding model for Portuguese, so called BioBERTpt, to support clinical and biomedical NER. We transfer learned information encoded in a multilingual-BERT model to a corpora of clinical narratives and biomedical-scientific papers in Brazilian Portuguese.

[GitHub repository]

Introduction

In the clinical domain, NER can be used to identify clinical concepts, such as diseases, signs, procedures and drugs, supporting other data analysis as prediction of future clinical events, summarization, and relation extraction between entities (e.g., drug-to-drug interaction).

When applying the general word representation models in healthcare text mining, the characteristics of clinical texts are not considered, known to be noisy, with a different vocabulary, expressions, and word distribution (Knake et al., 2016).

Therefore, contextual word embedding models, like BERT, can be fine-tuned, i.e., have their last layers updated to adapt to a specific domain, like clinical and biomedical, using domain-specific training data. (Ranti et al., 2020)

Several models were trained on clinical and biomedical corpora:

word2vec model trained on biomedical corpora (Pyysalo et al., 2013)
BioBERT - trained from scratch using scientific texts (Lee et al., 2019)
Clinical BERT - pre-trained model with clinical data (Alsentzer et al., 2019)

All these studies used English corpora. Indeed, there are few studies in lower resources languages for the clinical domain.

In Portuguese:

fastText model trained with clinical texts (Lopes et al., 2019)
CRF algorithm for the NER task (de Souza et al., 2019)
Clinical word embedding model evaluated on Urinary Tract Infection disease identification (Oliveira et al., 2019)

The objective of this work is to assess the performance of a domain specific attention-based model, BioBERTpt, to support NER tasks in Portuguese clinical narratives.

Methods

Development of BioBERTpt

We fine-tuned three BERT-based models on Portuguese clinical and biomedical corpora, initialized with multilingual BERT weights provided by Devlin et al. (2018).

1. BioBERTpt(clin)

A model with clinical data, from the narratives of Brazilian hospitals.

In total, the clinical notes contain 3.8 million sentences with 27.7 million words.

2. BioBERTpt(bio)

A model with biomedical data, from scientific papers titles and abstracts, composed by documents from Scielo and Pubmed databases about biological and health, resulting in 16.4 million words.

3. BioBERTpt(all)

A full version, i.e., using both clinical and biomedical data.

NER experiments

Two NER experiments, using the following corpora:

SemClinBr (Oliveira et al., 2020), a semantically annotated corpus for Portuguese clinical NER, containing 1,000 labeled clinical notes.
CLINpt (Lopes et al., 2019), a collection of 281 Neurology clinical case descriptions, with manually-annotated named entities.

We compare BioBERTpt with the already existing contextual models:

BERT multilingual uncased(cased)
Portuguese BERT base(large) (Souza et al., 2019)

Discussion

Effect of domain

By evaluating BioBERTpt, we found that the domain can influence the performance of BERT-based models, particularly for domains with unique characteristics such as medical. Our in-domain models achieved higher results for average metrics.

Effect of the contextualized language model

The use of BERT-base models in our work had a positive impact on the results when compared to previous works with traditional machine learning algorithms and word embeddings for NER in Portuguese clinical text (de Souza et al., 2019; Lopes et al., 2019).

Effect of language

The generic Portuguese BERT models (Souza et al., 2019) were outperformed by the BERT multilingual versions. This may be due to a local minima or the catastrophic forgetting.

Catastrophic forgetting: it can happen during fine-tuning step, by overwriting previous knowledge of the model with new distinct knowledge, leading to a loss of information on lower layers (Xu et al., 2019)

This may have occurred since the linguistic characteristics of clinical texts are very different from the corpus used during pre-training phase of Portuguese BERT, a Web Corpus from 120,000 different Brazilian websites.

Future work

We would like to explore larger transformers-based models in the clinical Portuguese domain and evaluate our model in different clinical NLP tasks, such as negation detection, summarization and de-identification.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2020_BioBERTpt.md

2020_BioBERTpt.md

BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition - Schneider et al. - 2020

📌 Paper

tl;dr

Introduction

Methods

Development of BioBERTpt

1. BioBERTpt(clin)

2. BioBERTpt(bio)

3. BioBERTpt(all)

NER experiments

Discussion

Effect of domain

Effect of the contextualized language model

Effect of language

Future work

Files

2020_BioBERTpt.md

Latest commit

History

2020_BioBERTpt.md

File metadata and controls

BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition - Schneider et al. - 2020

📌 Paper

tl;dr

Introduction

Methods

Development of BioBERTpt

1. BioBERTpt(clin)

2. BioBERTpt(bio)

3. BioBERTpt(all)

NER experiments

Discussion

Effect of domain

Effect of the contextualized language model

Effect of language

Future work