Found taxon name has spurious characters #150

jbest · 2024-01-05T23:33:48Z

When a found taxon name is immediately preceded by a set of brackets with numbers "[EX###]" the name that is returned is prepended with the contents of the brackets with numbers replaced with "�" (Unicode U+FFFD) (at least in my editors).

Example below:

"verbatim": "493.[SC493]Silybum marianum",
"name": "Sc��silybum marianum",

This problem does not arise if there is a space character after the closing bracket, e.g. "[SC493] Silybum marianum"

After further investigation, I found some new behavior. The above was using the API, below is using the web interface:
for the input:
493.[SC495]Quercus rubrum
493.[SC493]Silybum marianum
493.[SC495]Quercus alba
493.[SC493] Silybum marianum

for some reason some were found, but Quercus alba was not - the results in JSON:
{
"metadata": {
"documentation": "",
"date": "2024-01-06T00:33:46.215756806Z",
"gnfinderVersion": "v1.1.3",
"nameFindingSec": 0.000258374,
"totalSec": 0.000258374,
"wordsAround": 0,
"language": "eng",
"withUniqueNames": true,
"withBayes": true,
"totalWords": 9,
"totalNameCandidates": 5,
"totalNames": 3
},
"names": [
{
"cardinality": 2,
"name": "Sc��quercus rubrum",
"oddsLog10": 6.3452923554738145,
"start": 0,
"end": 25
},
{
"cardinality": 2,
"name": "Sc��silybum marianum",
"oddsLog10": 5.617378305659413,
"start": 27,
"end": 54
},
{
"cardinality": 2,
"name": "Silybum marianum",
"oddsLog10": 10.18840206871061,
"start": 93,
"end": 109
}
]
}

dimus · 2024-01-10T15:04:28Z

Thank you for letting us know about the problem, @jbest. I think the problem is with the tokenizing stage. Currently the following characters are considered to be a splitting character between tokens:

// space chars that indicate new line have value true
var spaceChr = map[rune]bool{
	'\n':     true,
	'\r':     true,
	'\v':     false,
	'\t':     false,
	'\uFEFF': false,
	' ':      false,
}

I am a bit reluctant to add more characters, without some thought (to decrease the amount of false positives). Can you describe with more detail what kind of a text is this, that created such problems?

jbest · 2024-01-10T15:48:41Z

@dimus The examples I provided above are fabricated, but represent a rare scenario we encountered in our text. The text is human transcription of a botanist's field notebook. The brackets don't exist in the source material, they are added by transcribers to standardize the field number because the number written in the notebook sometimes omits the first digit (e.g. 234 should actually be 1234). We've instructed transcribers to make sure brackets have spaces before and after them to prevent this error in the future so we have a solution that works. But I'm curious about why Quercus rubrum is found (though with spurious characters added to the result), but Quercus alba is not, e.g.:
494.[SC494]Quercus rubrum
495.[SC495]Quercus alba

Below is some actual text (without an example that would generate this error):

[SC42] (Crucifer) – same –
rocky bluff facing
sea.
[SC43] (Amsinckia) – same.
Orange blotches
in throat.
[SC44] (Platystema) – same.
[SC45] (Ceanothus) bluff facing
sea – wind distorted
fan[?]
[SC46] (Viola) same location.
[SC47] (Vaccinium) same
location – wind
distorted fan.
[SC48] (Iris[?]) – same location
[SC49] Asteraceae – ligulae
billed[?] on windward
side – same
location –
small shrub – 8” high

Thanks for this incredible tool, we couldn't do our work without it!

dimus · 2024-01-10T16:28:26Z

@jbest, thank you for your kind words!

Hm, the text you provided should not create any problems, because there is a space betwen a [SC**] tag and the name. With the "Show ambiquous uninomials" flag I get

Index,Verbatim,Name,Start,End,OddsLog10,Cardinality,AnnotNomenType,WordsBefore,WordsAfter
0,(Amsinckia),Amsinckia,61,72,4.74,1,NO_ANNOT,,
1,(Ceanothus),Ceanothus,147,158,3.88,1,NO_ANNOT,,
2,(Viola),Viola,210,217,,1,NO_ANNOT,,
3,(Vaccinium),Vaccinium,241,252,5.42,1,NO_ANNOT,,
4,(Iris[?]),Iris,299,308,,1,NO_ANNOT,,
5,Asteraceae,Asteraceae,333,343,4.75,1,NO_ANNOT,,

The missing Crucifer and Platystema do not appear anywhere in the databases: https://verifier.globalnames.org/?capitalize=on&format=json&names=Crucifer%0D%0APlatystema

jbest · 2024-01-11T01:47:23Z

@dimus Right, this last sample had all the spaces correctly added before and after brackets and all of our text going forward will have that correction. The text we are transcribing is a challenge to read sometimes and has some mis-spellings so we're not expecting to find all names automatically. "Platystema" should be "Platystemma". "Crucifer" isn't a proper scientific name, just a common name/shorthand for Brassicaceae.

dimus · 2024-01-11T12:59:01Z

I think a solution for situations where names are not separated by spaces or () is to add an option to relax tokenizer, so it can split not only by spaces, but also by other separators common is some documents, like ,.<>[]{}. I am going to close this issue and add another one instead. Please leave your comments there, if you have some ideas about implementation @jbest

#151

jbest changed the title ~~Found taxon name~~ Found taxon name has spurious characters Jan 5, 2024

dimus mentioned this issue Jan 11, 2024

Add an option that would relax token formation #151

Open

dimus closed this as completed Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Found taxon name has spurious characters #150

Found taxon name has spurious characters #150

jbest commented Jan 5, 2024 •

edited

Loading

dimus commented Jan 10, 2024 •

edited

Loading

jbest commented Jan 10, 2024

dimus commented Jan 10, 2024 •

edited

Loading

jbest commented Jan 11, 2024

dimus commented Jan 11, 2024 •

edited

Loading

Found taxon name has spurious characters #150

Found taxon name has spurious characters #150

Comments

jbest commented Jan 5, 2024 • edited Loading

dimus commented Jan 10, 2024 • edited Loading

jbest commented Jan 10, 2024

dimus commented Jan 10, 2024 • edited Loading

jbest commented Jan 11, 2024

dimus commented Jan 11, 2024 • edited Loading

jbest commented Jan 5, 2024 •

edited

Loading

dimus commented Jan 10, 2024 •

edited

Loading

dimus commented Jan 10, 2024 •

edited

Loading

dimus commented Jan 11, 2024 •

edited

Loading