You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As discussed in issue #26, DBpedia Spotlight occasionally annotates entities which do not perfectly align with token spans. The fix discussed in issue #26 is incomplete, however.
Working with GermaParl, it became apparent that there are scenarios in which tokenization can be tricky for the left entity boundary as well. In phrases like "G-8-Gipfel" (which, in GermaParl is often tokenized in two tokens, "G" and "-8-Gipfel"), the entity identified by DBpedia Spotlight is "Gipfel" which starts in the middle of the token. This is an issue when we join tokens and entities based on their starting positions as the offset is different, thus leading to a "NA" value in the left corpus position of the entity.
Potential Solutions
If we want to address this, we could use the same approach as suggested for issue #26: Expand the span to the previous token boundary. For this, we could compare the starting positions of the entity and tokens and chose the previous token using an extended version of the expand_fun() auxiliary function introduced earlier:
if (isTRUE(drop_inexact_annotations) & any(is.na(tab[["cpos_right"]]))) {
Discussion
As with issue #26, this should be optional and comes with some conceptual considerations, in particular whether it always makes sense to expand the entity span to match the token span.
This might also be not very efficient as this is checked for each entity.
The text was updated successfully, but these errors were encountered:
Issue
As discussed in issue #26, DBpedia Spotlight occasionally annotates entities which do not perfectly align with token spans. The fix discussed in issue #26 is incomplete, however.
Working with GermaParl, it became apparent that there are scenarios in which tokenization can be tricky for the left entity boundary as well. In phrases like "G-8-Gipfel" (which, in GermaParl is often tokenized in two tokens, "G" and "-8-Gipfel"), the entity identified by DBpedia Spotlight is "Gipfel" which starts in the middle of the token. This is an issue when we join tokens and entities based on their starting positions as the offset is different, thus leading to a "NA" value in the left corpus position of the entity.
Potential Solutions
If we want to address this, we could use the same approach as suggested for issue #26: Expand the span to the previous token boundary. For this, we could compare the starting positions of the entity and tokens and chose the previous token using an extended version of the
expand_fun()
auxiliary function introduced earlier:This would make it necessary to adjust the following chunk as well:
The possibility that there are incomplete annotations for "cpos_left" should be considered here as well:
dbpedia/R/dbpedia.R
Line 693 in 4a8fd3c
Discussion
As with issue #26, this should be optional and comes with some conceptual considerations, in particular whether it always makes sense to expand the entity span to match the token span.
This might also be not very efficient as this is checked for each entity.
The text was updated successfully, but these errors were encountered: