Skip to content

wordpiece tokenization issue with named entity recognition #37

Answered by WissamAntoun
rajae-Bens asked this question in Q&A
Discussion options

You must be logged in to vote

For NER we only put the label on the first token of a word that has been split, and for the other tokens we label it with the ignore index of the CrossEntropyLoss function as you can see in the code bellow.

word_tokens = TOKENIZER.tokenize(clean_word)
if len(word_tokens) > 0:
   tokens.extend(word_tokens)    
   # Use the real label id for the first token of the word, and padding ids for the remaining tokens
   label_ids.extend([self.label_map[label]] + [self.pad_token_label_id] * (len(word_tokens) - 1))
self.pad_token_label_id = nn.CrossEntropyLoss().ignore_index
# Use cross entropy ignore_index as padding label id so that only
# real label ids contribute to the loss later.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by WissamAntoun
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #37 on December 09, 2020 13:36.