-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Found taxon name has spurious characters #150
Comments
Thank you for letting us know about the problem, @jbest. I think the problem is with the tokenizing stage. Currently the following characters are considered to be a splitting character between tokens: // space chars that indicate new line have value true
var spaceChr = map[rune]bool{
'\n': true,
'\r': true,
'\v': false,
'\t': false,
'\uFEFF': false,
' ': false,
} I am a bit reluctant to add more characters, without some thought (to decrease the amount of false positives). Can you describe with more detail what kind of a text is this, that created such problems? |
@dimus The examples I provided above are fabricated, but represent a rare scenario we encountered in our text. The text is human transcription of a botanist's field notebook. The brackets don't exist in the source material, they are added by transcribers to standardize the field number because the number written in the notebook sometimes omits the first digit (e.g. 234 should actually be 1234). We've instructed transcribers to make sure brackets have spaces before and after them to prevent this error in the future so we have a solution that works. But I'm curious about why Quercus rubrum is found (though with spurious characters added to the result), but Quercus alba is not, e.g.: Below is some actual text (without an example that would generate this error):
Thanks for this incredible tool, we couldn't do our work without it! |
@jbest, thank you for your kind words! Hm, the text you provided should not create any problems, because there is a space betwen a [SC**] tag and the name. With the "Show ambiquous uninomials" flag I get
The missing Crucifer and Platystema do not appear anywhere in the databases: https://verifier.globalnames.org/?capitalize=on&format=json&names=Crucifer%0D%0APlatystema |
@dimus Right, this last sample had all the spaces correctly added before and after brackets and all of our text going forward will have that correction. The text we are transcribing is a challenge to read sometimes and has some mis-spellings so we're not expecting to find all names automatically. "Platystema" should be "Platystemma". "Crucifer" isn't a proper scientific name, just a common name/shorthand for Brassicaceae. |
I think a solution for situations where names are not separated by spaces or |
When a found taxon name is immediately preceded by a set of brackets with numbers "[EX###]" the name that is returned is prepended with the contents of the brackets with numbers replaced with "�" (Unicode U+FFFD) (at least in my editors).
Example below:
"verbatim": "493.[SC493]Silybum marianum",
"name": "Sc����silybum marianum",
This problem does not arise if there is a space character after the closing bracket, e.g. "[SC493] Silybum marianum"
After further investigation, I found some new behavior. The above was using the API, below is using the web interface:
for the input:
493.[SC495]Quercus rubrum
493.[SC493]Silybum marianum
493.[SC495]Quercus alba
493.[SC493] Silybum marianum
for some reason some were found, but Quercus alba was not - the results in JSON:
{
"metadata": {
"documentation": "",
"date": "2024-01-06T00:33:46.215756806Z",
"gnfinderVersion": "v1.1.3",
"nameFindingSec": 0.000258374,
"totalSec": 0.000258374,
"wordsAround": 0,
"language": "eng",
"withUniqueNames": true,
"withBayes": true,
"totalWords": 9,
"totalNameCandidates": 5,
"totalNames": 3
},
"names": [
{
"cardinality": 2,
"name": "Sc����quercus rubrum",
"oddsLog10": 6.3452923554738145,
"start": 0,
"end": 25
},
{
"cardinality": 2,
"name": "Sc����silybum marianum",
"oddsLog10": 5.617378305659413,
"start": 27,
"end": 54
},
{
"cardinality": 2,
"name": "Silybum marianum",
"oddsLog10": 10.18840206871061,
"start": 93,
"end": 109
}
]
}
The text was updated successfully, but these errors were encountered: