wordpiece tokenization issue with named entity recognition #37
Answered
by
WissamAntoun
rajae-Bens
asked this question in
Q&A
-
Hi, As u know bert used wordpiece tokenization. How did u managed this for named entity tagging? because one word can be split to multiple pieces. I think u did not mention this in ur notebook AraBERT-ANERCorp-CamelSplits. Also, can u add an inference to a single instance in ur code? Thank s in advance |
Beta Was this translation helpful? Give feedback.
Answered by
WissamAntoun
Nov 29, 2020
Replies: 1 comment
-
For NER we only put the label on the first token of a word that has been split, and for the other tokens we label it with the ignore index of the CrossEntropyLoss function as you can see in the code bellow. word_tokens = TOKENIZER.tokenize(clean_word)
if len(word_tokens) > 0:
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([self.label_map[label]] + [self.pad_token_label_id] * (len(word_tokens) - 1)) self.pad_token_label_id = nn.CrossEntropyLoss().ignore_index
# Use cross entropy ignore_index as padding label id so that only
# real label ids contribute to the loss later. |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
WissamAntoun
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For NER we only put the label on the first token of a word that has been split, and for the other tokens we label it with the ignore index of the CrossEntropyLoss function as you can see in the code bellow.