Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Found taxon name has spurious characters #150

Closed
jbest opened this issue Jan 5, 2024 · 5 comments
Closed

Found taxon name has spurious characters #150

jbest opened this issue Jan 5, 2024 · 5 comments

Comments

@jbest
Copy link

jbest commented Jan 5, 2024

When a found taxon name is immediately preceded by a set of brackets with numbers "[EX###]" the name that is returned is prepended with the contents of the brackets with numbers replaced with "�" (Unicode U+FFFD) (at least in my editors).

Example below:

"verbatim": "493.[SC493]Silybum marianum",
"name": "Sc����silybum marianum",

This problem does not arise if there is a space character after the closing bracket, e.g. "[SC493] Silybum marianum"

After further investigation, I found some new behavior. The above was using the API, below is using the web interface:
for the input:
493.[SC495]Quercus rubrum
493.[SC493]Silybum marianum
493.[SC495]Quercus alba
493.[SC493] Silybum marianum

for some reason some were found, but Quercus alba was not - the results in JSON:
{
"metadata": {
"documentation": "",
"date": "2024-01-06T00:33:46.215756806Z",
"gnfinderVersion": "v1.1.3",
"nameFindingSec": 0.000258374,
"totalSec": 0.000258374,
"wordsAround": 0,
"language": "eng",
"withUniqueNames": true,
"withBayes": true,
"totalWords": 9,
"totalNameCandidates": 5,
"totalNames": 3
},
"names": [
{
"cardinality": 2,
"name": "Sc����quercus rubrum",
"oddsLog10": 6.3452923554738145,
"start": 0,
"end": 25
},
{
"cardinality": 2,
"name": "Sc����silybum marianum",
"oddsLog10": 5.617378305659413,
"start": 27,
"end": 54
},
{
"cardinality": 2,
"name": "Silybum marianum",
"oddsLog10": 10.18840206871061,
"start": 93,
"end": 109
}
]
}

@jbest jbest changed the title Found taxon name Found taxon name has spurious characters Jan 5, 2024
@dimus
Copy link
Member

dimus commented Jan 10, 2024

Thank you for letting us know about the problem, @jbest. I think the problem is with the tokenizing stage. Currently the following characters are considered to be a splitting character between tokens:

// space chars that indicate new line have value true
var spaceChr = map[rune]bool{
	'\n':     true,
	'\r':     true,
	'\v':     false,
	'\t':     false,
	'\uFEFF': false,
	' ':      false,
}

I am a bit reluctant to add more characters, without some thought (to decrease the amount of false positives). Can you describe with more detail what kind of a text is this, that created such problems?

@jbest
Copy link
Author

jbest commented Jan 10, 2024

@dimus The examples I provided above are fabricated, but represent a rare scenario we encountered in our text. The text is human transcription of a botanist's field notebook. The brackets don't exist in the source material, they are added by transcribers to standardize the field number because the number written in the notebook sometimes omits the first digit (e.g. 234 should actually be 1234). We've instructed transcribers to make sure brackets have spaces before and after them to prevent this error in the future so we have a solution that works. But I'm curious about why Quercus rubrum is found (though with spurious characters added to the result), but Quercus alba is not, e.g.:
494.[SC494]Quercus rubrum
495.[SC495]Quercus alba

Below is some actual text (without an example that would generate this error):

  1. [SC42] (Crucifer) – same –
    rocky bluff facing
    sea.
  2. [SC43] (Amsinckia) – same.
    Orange blotches
    in throat.
  3. [SC44] (Platystema) – same.
  4. [SC45] (Ceanothus) bluff facing
    sea – wind distorted
    fan[?]
  5. [SC46] (Viola) same location.
  6. [SC47] (Vaccinium) same
    location – wind
    distorted fan.
  7. [SC48] (Iris[?]) – same location
  8. [SC49] Asteraceae – ligulae
    billed[?] on windward
    side – same
    location –
    small shrub – 8” high

Thanks for this incredible tool, we couldn't do our work without it!

@dimus
Copy link
Member

dimus commented Jan 10, 2024

@jbest, thank you for your kind words!

Hm, the text you provided should not create any problems, because there is a space betwen a [SC**] tag and the name. With the "Show ambiquous uninomials" flag I get

Index,Verbatim,Name,Start,End,OddsLog10,Cardinality,AnnotNomenType,WordsBefore,WordsAfter
0,(Amsinckia),Amsinckia,61,72,4.74,1,NO_ANNOT,,
1,(Ceanothus),Ceanothus,147,158,3.88,1,NO_ANNOT,,
2,(Viola),Viola,210,217,,1,NO_ANNOT,,
3,(Vaccinium),Vaccinium,241,252,5.42,1,NO_ANNOT,,
4,(Iris[?]),Iris,299,308,,1,NO_ANNOT,,
5,Asteraceae,Asteraceae,333,343,4.75,1,NO_ANNOT,,

The missing Crucifer and Platystema do not appear anywhere in the databases: https://verifier.globalnames.org/?capitalize=on&format=json&names=Crucifer%0D%0APlatystema

@jbest
Copy link
Author

jbest commented Jan 11, 2024

@dimus Right, this last sample had all the spaces correctly added before and after brackets and all of our text going forward will have that correction. The text we are transcribing is a challenge to read sometimes and has some mis-spellings so we're not expecting to find all names automatically. "Platystema" should be "Platystemma". "Crucifer" isn't a proper scientific name, just a common name/shorthand for Brassicaceae.

@dimus
Copy link
Member

dimus commented Jan 11, 2024

I think a solution for situations where names are not separated by spaces or () is to add an option to relax tokenizer, so it can split not only by spaces, but also by other separators common is some documents, like ,.<>[]{}. I am going to close this issue and add another one instead. Please leave your comments there, if you have some ideas about implementation @jbest

#151

@dimus dimus closed this as completed Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants