wordpiece tokenization issue with named entity recognition #37

rajae-Bens · 2020-11-27T17:41:09Z

rajae-Bens
Nov 27, 2020

Hi,

As u know bert used wordpiece tokenization. How did u managed this for named entity tagging? because one word can be split to multiple pieces. I think u did not mention this in ur notebook AraBERT-ANERCorp-CamelSplits. Also, can u add an inference to a single instance in ur code?

Thank s in advance

Answered by WissamAntoun

Nov 29, 2020

For NER we only put the label on the first token of a word that has been split, and for the other tokens we label it with the ignore index of the CrossEntropyLoss function as you can see in the code bellow.

word_tokens = TOKENIZER.tokenize(clean_word)
if len(word_tokens) > 0:
   tokens.extend(word_tokens)    
   # Use the real label id for the first token of the word, and padding ids for the remaining tokens
   label_ids.extend([self.label_map[label]] + [self.pad_token_label_id] * (len(word_tokens) - 1))

self.pad_token_label_id = nn.CrossEntropyLoss().ignore_index
# Use cross entropy ignore_index as padding label id so that only
# real label ids contribute to the loss later.

View full answer

WissamAntoun · 2020-11-29T05:36:55Z

WissamAntoun
Nov 29, 2020
Maintainer

For NER we only put the label on the first token of a word that has been split, and for the other tokens we label it with the ignore index of the CrossEntropyLoss function as you can see in the code bellow.

word_tokens = TOKENIZER.tokenize(clean_word)
if len(word_tokens) > 0:
   tokens.extend(word_tokens)    
   # Use the real label id for the first token of the word, and padding ids for the remaining tokens
   label_ids.extend([self.label_map[label]] + [self.pad_token_label_id] * (len(word_tokens) - 1))

self.pad_token_label_id = nn.CrossEntropyLoss().ignore_index
# Use cross entropy ignore_index as padding label id so that only
# real label ids contribute to the loss later.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordpiece tokenization issue with named entity recognition #37

{{title}}

Replies: 1 comment

{{title}}

Select a reply

wordpiece tokenization issue with named entity recognition #37

rajae-Bens Nov 27, 2020

Replies: 1 comment

WissamAntoun Nov 29, 2020 Maintainer

rajae-Bens
Nov 27, 2020

WissamAntoun
Nov 29, 2020
Maintainer