Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "mihi" to enhance scientific name finding and parsing #230

Closed
Archilegt opened this issue Jun 10, 2022 · 23 comments
Closed

Use "mihi" to enhance scientific name finding and parsing #230

Archilegt opened this issue Jun 10, 2022 · 23 comments

Comments

@Archilegt
Copy link

The Latin word "mihi" was used by authors when proposing new scientific names, with the meaning of "me". The word could be used as a marker for "scientific name ends here", and could enhance scientific name finding if coupled to "search for scientific name 1, 2, 3 words ahead".
The word could also be used for adding "interpreted authorship" (author+date) to scientific names instances if coupled to the publication (book, article) metadata where the scientific name instance is matched, therefore potentially helping to disambiguate homonyms.
A quick glance at the occurrence of the word in BHL: https://www.biodiversitylibrary.org/search?searchTerm=mihi&stype=F#/titles
Maybe it would be worth trying at least the "scientific name ends here" suggestion? :)

@dimus
Copy link
Member

dimus commented Jun 10, 2022

Searching gnverifier database got 20 names with mihi:

Anisochaeta kiwi mihi Blakemore 2012
 Aeolesthes inhirsutus mihi
 Bruchus nongoniermani Mihi,
 Anisochaeta kiwi mihi
 Anisochaeta kiwi mihi Blakemore, 2013
 Chyphononyx simulator mihi
 Chimila tinguana mihi
 Cobosidea mihi
 Lithobius leostygis mihi
 Conferva geminata var. mihi
 Eucyclops serrulatus mihi
 Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
 Conferva geminata var. mihi Schwabe
 Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
 Lithobius (Polybothrus) leostygis subsp. mihi
 Quexua alinella mihi
 Lithobius (Polyrbothrus) caesar subsp. mihi
 Odonthophagus var. c mihi
 Scutella agassizi mihi
 Trochus patholatus mihi

@dimus
Copy link
Member

dimus commented Jun 10, 2022

Looks like mihi word has several meanings:


Conferva geminata var. mihi
Conferva geminata var. mihi Schwabe
AlgaeBase
Eukaryota unassigned phylum|Eukaryota unassigned class|Eukaryota unassigned order||Conferva|Conferva geminata mihi

Conferva geminata var. mihi Schwabe
Conferva geminata var. mihi Schwabe
AlgaeBase
Eukaryota unassigned phylum|Eukaryota unassigned class|Eukaryota unassigned order|Conferva|Conferva geminata mihi

Eucyclops serrulatus mihi
Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
Catalogue of Life
Biota|Animalia|Arthropoda|Hexanauplia|Copepoda|Neocopepoda|Podoplea|Cyclopoida|Cyclopida|Cyclopidae|Eucyclops|Eucyclops serrulatus serrulatus|Eucyclops serrulatus mihi

Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
Catalogue of Life
Biota|Animalia|Arthropoda|Hexanauplia|Copepoda|Neocopepoda|Podoplea|Cyclopoida|Cyclopida|Cyclopidae|Eucyclops|Eucyclops serrulatus serrulatus|Eucyclops serrulatus mihi

Aeolesthes inhirsutus mihi
Aeolesthes inhirsutus mihi
EOL

Chyphononyx simulator mihi
Chyphononyx simulator mihi
EOL

Chimila tinguana mihi
Chimila tinguana mihi
EOL,

Quexua alinella mihi
Quexua alinella mihi
EOL

Cobosidea mihi
Cobosidea mihi
ION

Odonthophagus var. c mihi
Odonthophagus
ION

Scutella agassizi mihi
Scutella agassizi mihi
ION

Trochus patholatus mihi
Trochus patholatus mihi
ION

Lithobius leostygis mihi
Lithobius (Polybothrus) leostygis subsp. mihi
Plazi

Lithobius (Polybothrus) leostygis subsp. mihi
Lithobius (Polybothrus) leostygis subsp. mihi
Plazi

Lithobius (Polyrbothrus) caesar subsp. mihi
Lithobius (Polyrbothrus) caesar subsp. mihi
Plazi

Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
Union 4
|Cellular life|Eukaryota|Opisthokonts|Fungi|Fungi|Ascomycota|Sordariomycetes|Hypocreales|Hypocreaceae|Hypomyces|Hypomyces chrysospermus edulis-mihi

Anisochaeta kiwi mihi Blakemore 2012
Anisochaeta kiwi mihi Blakemore, 2013
WoRMS
Biota|Animalia|Annelida|Clitellata|Oligochaeta|Crassiclitellata|Megascolecida|Megascolecidae|Anisochaeta|Anisochaeta kiwi|Anisochaeta kiwi mihi

Anisochaeta kiwi mihi
Anisochaeta kiwi mihi Blakemore 2013
WoRMS
Biota|Animalia|Annelida|Clitellata|Oligochaeta|Crassiclitellata|Megascolecida|Megascolecidae|Anisochaeta|Anisochaeta kiwi|Anisochaeta kiwi mihi

Anisochaeta kiwi mihi Blakemore, 2013
Anisochaeta kiwi mihi Blakemore, 2013
WoRMS
Biota|Animalia|Annelida|Clitellata|Oligochaeta|Crassiclitellata|Megascolecida|Megascolecidae|Anisochaeta|Anisochaeta kiwi|Anisochaeta kiwi mihi

Bruchus nongoniermani Mihi
Bruchus nongoniermani Mihi
uBio NameBank
Bruchus nongoniermani

@dimus
Copy link
Member

dimus commented Jun 10, 2022

I dont worry about Union, uBio, ION, and EOL, they are not human-curated, but AlgaeBase, CoL and WoRMS seem to have names with legitimate use of mihi as epithets. So parser should take at least these names as exceptions to the rule

@Archilegt
Copy link
Author

Many thanks, Dima!
Good to know that if "mihi" is applied, it may give "false positives" in a very small subset of names, compared to the "true positives" for which it does represent a terminal element.

Name deduplication: I believe that for the sake of counting potentially affected names, the 20 name instances that you found can be deduplicated down to 15, as follows:

  • Minus two instances of three Anisochaeta kiwi mihi
  • Minus one instance of two Conferva geminata var. mihi
  • Minus one instance of two Eucyclops serrulatus mihi
  • Minus one instance of Lithobius leostygis mihi and Lithobius (Polybothrus) leostygis subsp. mihi because the gnparser will cut the subgeneric name off.

Deduplicated list of names:

  1. Aeolesthes inhirsutus mihi
  2. Anisochaeta kiwi mihi Blakemore 2012
  3. Bruchus nongoniermani Mihi,
  4. Chimila tinguana mihi
  5. Chyphononyx simulator mihi
  6. Cobosidea mihi
  7. Conferva geminata var. mihi Schwabe
  8. Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
  9. Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
  10. Lithobius leostygis mihi and Lithobius (Polybothrus) leostygis subsp. mihi
  11. Lithobius (Polyrbothrus) caesar subsp. mihi
  12. Odonthophagus var. c mihi
  13. Quexua alinella mihi
  14. Scutella agassizi mihi
  15. Trochus patholatus mihi

@Archilegt
Copy link
Author

Archilegt commented Jun 10, 2022

Names by Plazi:

Scientific name: Lithobius (Polyrbothrus) caesar mihi
https://tb.plazi.org/GgServer/html/299583C14F747A72E86065049FDE3C22
A misspelling for Polybothrus, plus a digitization artifact which should not have included "mihi".
Published string is spelled and styled correctly, as "4. Lithobius (Polybothrus) caesar mihi."
See https://www.biodiversitylibrary.org/page/13294205

Scientific name: Lithobius (Polybothrus) leostygis subsp. mihi
https://tb.plazi.org/GgServer/html/CCEB9C62C87766E980DD858BC13468C8
A digitization artifact which should not have included "mihi".
Published string is styled correctly, as "1. Lithobius (Polyhothrus) leostygis mihi".
See See https://www.biodiversitylibrary.org/page/13294201

Scientific name: Lithobius leostygis mihi
This instance points to the one above and I could not find a URL for it.

Result: The three (two when deduplicated) scientific name instances contributed by Plazi are false-positive digitization artifacts, including a misspelling.

Deduplicated list of names v.2 (Plazi names cleared):

  1. Aeolesthes inhirsutus mihi
  2. Anisochaeta kiwi mihi Blakemore 2012
  3. Bruchus nongoniermani Mihi,
  4. Chimila tinguana mihi
  5. Chyphononyx simulator mihi
  6. Cobosidea mihi
  7. Conferva geminata var. mihi Schwabe
  8. Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
  9. Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953
  10. Odonthophagus var. c mihi
  11. Quexua alinella mihi
  12. Scutella agassizi mihi
  13. Trochus patholatus mihi

@Archilegt
Copy link
Author

Archilegt commented Jun 10, 2022

Anomaly: The name "Odonthophagus var. c mihi", coming from ION has so many anomalies that it seems irrelevant to GNA for name finding.
Source anomalies: The generic name is given as both "Onthophagus" (https://www.biodiversitylibrary.org/page/8222096) and "Odonthophagus" (https://www.biodiversitylibrary.org/page/8221999) in the "Enumeratio Insectorum Norvegicorum. Fasciculus ii." which ION points to. Additionally, it is not a scientific name in itself, e.g., it is the name of a variety designated by a single letter.
Digitization anomalies: Name digitized with the genus "Odonthophagus" instead of ""Onthophagus". Name not including the specific epithet, supposedly "fracticornis", to which "var. c" is to be ascribed. The "mihi" seems to be a false positive, added by the recorder, as it is not a text string in the referred publication.

Overall, the name can be considered a false positive for mihi and can be deleted from the list.

Deduplicated list of names v.3:

  1. Aeolesthes inhirsutus mihi
  2. Anisochaeta kiwi mihi Blakemore 2012 (true positive, a regrettable alternative to mihiensis)
  3. Bruchus nongoniermani Mihi,
  4. Chimila tinguana mihi
  5. Chyphononyx simulator mihi
  6. Cobosidea mihi
  7. Conferva geminata var. mihi Schwabe (AlgaeBase, taxon?)
  8. Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966
  9. Hypomyces chrysospermus f. edulis-mihi K. Bitner 1953 (Union 4, Fungi: Ascomycota)
  10. Quexua alinella mihi
  11. Scutella agassizi mihi
  12. Trochus patholatus mihi

@dimus, could someone check the "algal" and fungal names for you, so that we can know if they are true or false positives? A copy of the original publication would be desirable.

@dimus
Copy link
Member

dimus commented Jun 10, 2022

Word mihi happens 192254 times in BHL

Conferva geminata var. mihi Schwabe:
https://verifier.globalnames.org/?capitalize=on&format=html&names=Conferva+geminata+var.+mihi+Schwabe
https://www.algaebase.org/search/species/detail/?species_id=93703

edulus-mihi is not a problem, so I do not worry about it

@Archilegt
Copy link
Author

"Conferva geminata var. mihi Schwabe" may be hard to match. The combination is uncurated in AlgaeBase and there is no guarantee that it is an original combination. There are no recorded references for that combination.
The original combination may be Oscillatoria geminata Schwabe. When searched for that combination and author, AlgaeBase returns "Oscillatoria geminata Schwabe ex Gomont 1892" (https://www.algaebase.org/search/species/detail/?species_id=51094), which is also not the original treatment.
The original treatment for Oscillatoria geminata Schwabe can be found at:
Linnaea 11 (1), year 1837
Page 118: https://www.biodiversitylibrary.org/page/35312749
Tab. 1, Fig. 7: https://www.biodiversitylibrary.org/page/35313360

Confirming whether these are two combinations of the same name and whether the "mihi" is an artifact would require consulting with specialists familiar with the historical literature on Conferva and Oscillatoria. However, that is likely the case, as the author matches and there are currently combinations under both genera for a few species.

@dimus
Copy link
Member

dimus commented Jun 10, 2022

So my understanding is that really we have only these known exceptions for the parsing rule:

Anisochaeta kiwi mihi Blakemore 2012 
Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966

@Archilegt
Copy link
Author

Aeolesthes inhirsutus mihi seems another false positive.
The name string "Aeolesthes inhirsutus subsp. mihi M.Matsushita, 1932" is deleted from GBIF (https://www.gbif.org/species/8885942).
The string may have reached GBIF via JBIF (Japan). See entry for holotype of "Aeolesthes inhirsutus subsp. mihi M.Matsushita, 1932" at https://www.gbif.jp/gbif_search/detail?id=1_sehu-cole_urn:catalog:SEHU:COLE:0000000191

@Archilegt
Copy link
Author

Archilegt commented Jun 10, 2022

About "Eucyclops serrulatus mihi Dussart, Graf & Husson, 1966"
Dussart, Bernard; François Graf; and Roger Husson. 1966. Les Crustacés du réservoir de la Fontaine des Suisses à Dijon. International Journal of Speleology, 2: 269-281. http://dx.doi.org/10.5038/1827-806X.2.3.2

The "author" is only Dussart, as he is the sole responsible for Copepoda in that publication. The name string "Eucyclops serrulatus var. mihi" is apparently styled correctly (pages 270 and 278). However, this is a printing artifact which became a database artifact. Dussart stated on pp. 270-271 (translated): "The differences existing between these two forms are not sufficient to give a name to the variety with the spine of P5 slender. I need only mention its existence...". Also, as per the first edition of the International Code of Zoological Nomenclature (1961), "Article 15. Names published after 1960. — After 1960, a new name proposed conditionally, or one proposed explicitly as the name of a "variety" or "form" [Art. 45e], is not available." (https://www.biodiversitylibrary.org/page/34584570). This further points at an unnamed form by Dussart (1966), the "mihi" in this case also being a false positive that does not need to be added to the exceptions, at least from the nomenclatural point of view.

@dimus
Copy link
Member

dimus commented Aug 19, 2022

Hmm, looks like situation is even more interesting with mihi:

https://www.biodiversitylibrary.org/item/181042#page/535/mode/1up

Characium obovatum mihi. b. var. longipes mihi

I wonder if a better approach to mihi is to ignore it, instead of considering it the end of a name. But for gnfinder the use of
mihi as a name terminator word might work.

@dimus
Copy link
Member

dimus commented Aug 20, 2022

Thank you @Archilegt for interesting information aboutEucyclops serrulatus mihi, I'll pass it along to CoL guys. Do I understand correctly, that in zoology old names with var. or f. sometimes are promoted to subspecies rank? I would still add Eucyclops serrulatus mihi as an exception, because parser is not a nomenclatural authority and deals with data on a lexical level.

@dimus dimus closed this as completed in c3d6832 Aug 20, 2022
@Archilegt
Copy link
Author

Hi @dimus
I reported the issue with E. s. mihi to T. Chad Walter (https://www.marinespecies.org/copepoda/index.php) on 13.vi.2022 but I did not receive a reply. Maybe the COL will be able to reach him or someone else. Thanks!

@Archilegt
Copy link
Author

Hi @dimus
The case of Characium obovatum mihi. b. var. longipes mihi (https://www.biodiversitylibrary.org/page/47100016) is interesting. There you don't have one name but two. The string would be parsed by a human reader as:
Tab. VII.
Fig. 3. Characium obovatum mihi
Fig. 3b. Characium obovatum var. longipes mihi
where "b" is not part of the name but the explanation of an illustration (https://www.biodiversitylibrary.org/page/47100082).
The two mihi are indeed to be parsed as terminators but the first one could be also recognized as a connector. Detecting and reconstructing two names and recognizing a "b" as a figure indication might be too much to ask from a parser and could be left to a layer of annotations.
For strings less complex (e.g., without the "b") and containing two mihi, where Genus specificEpithet mihi [var., f.] subspecificEpithet mihi the parsing would be:

if 2 mihi, 
parse mihi 1, 
connect specificEpithet to subspecificEpithet, 
terminate before mihi 2

"...in zoology old names with var. or f. sometimes are promoted to subspecies rank?"
Yes, you are correct. The ZooCode has article "45. The species group", where article "45.5. Infrasubspecific names." The references therein will guide you to other articles.

@Archilegt
Copy link
Author

Hi @dimus
Shall we keep this issue open for some preliminary reporting on improved parsing? Or shall we do that via email or GoogleDocs? It would be great to have some stats on the actual improvement of the parser! :D

@dimus
Copy link
Member

dimus commented Aug 22, 2022

I do not have yet b. var. as a possible rank (not yet sure how common it is, to justify adding it to parsing). The parsing of Characium obovatum mihi. var. longipes mihi is now Characium obovatum var. longipes:

https://github.com/gnames/gnparser/blob/master/testdata/test_data.md#names-with-mihi

I think it is reasonable enough to close the ticket for now, especially because the parser does not deal with names that happen in biological texts, and it is extremely rare to have mihi in prepared lists of names.

If more concerns will appear about mihi we can make a new ticket and link it with this one.

@Archilegt
Copy link
Author

Dima, please note that b. var. is not a rank.
b refers to figure 3b
var. is a rank

@dimus
Copy link
Member

dimus commented Aug 22, 2022

Ah thank you for spotting it @Archilegt!

Dima, please note that b. var. is not a rank. b refers to figure 3b var. is a rank

Making gnfinder ticket about it gnames/gnfinder#125

@Archilegt
Copy link
Author

Ok. If the parsing of Characium obovatum mihi. var. longipes mihi is now Characium obovatum var. longipes, we can mention it as a special case of limitation of the parser, in which one string representing two names (one species, one subspecies) is parsed only to the subspecific name. We don't have to solve all the parsing problems in this round. ;-)

@dimus
Copy link
Member

dimus commented Aug 22, 2022

@Archilegt, do you think it makes better sense to parse Characium obovatum mihi. var. longipes mihi as Characium obovatum with var. longipes mihi as an unparseable tail? The parser does assume that a string must have only one name.

I tend to think about this string as an indication of implicit authorship in two places, kind of similar to Aus bus L. cus K.

@Archilegt
Copy link
Author

Archilegt commented Aug 23, 2022

"do you think it makes better sense to parse Characium obovatum mihi. var. longipes mihi as Characium obovatum with var. longipes mihi as an unparseable tail? The parser does assume that a string must have only one name."
No, I think that when choosing among two name strings, one should aim at retrieving the longest and most informative string along with the shortest unparseable tail. As it is now.

"I tend to think about this string as an indication of implicit authorship in two places, kind of similar to Aus bus L. cus K."
Yes, that would be the case for Characium obovatum mihi. var. longipes mihi.
However, here we have Fig 3. Characium obovatum mihi. b. var. longipes mihi
In an ideal world, the parser would:

  1. Execute a first parsing, with Fig. #langEn or Abb. #langDE followed by Arabic or Roman numerals ranking higher than scientificName. If Fig. or Abb. and numerals are detected, parse accordingly and wrap the whole string or substrings as explanationOfFigure
  2. Execute a second parsing for #ordered letters where #a can be ommitted and scoring letters higher if they are #letters enclosed by periods. Wrap resulting explanationOfSubfigure.
  3. Trigger name detection within each explanationOfSubfigure, with allowed values for single words specificEpithet and subspecificEpithet. Increase posterior score for explanationOfSubfigure wrappers if mihi terminators or authorName co-occur with periods of #ordered letters.
  4. Trigger name reconnection for explanationOfSubfigure values b to z if single word values specificEpithet and subspecificEpithet exist. Match subspecificEpithet to nearest anterior specificEpithet, match both to nearest anterior genus in order to assemble scientificName.

Example for Fig 3. Characium obovatum mihi. b. var. longipes mihi:
1.
<explanationOfFigure>Fig 3. Characium obovatum mihi. b. var. longipes mihi</explanationOfFigure> #langEn #numeralArabic

<explanationOfFigure>Fig 3.
<explanationOfsubfigure>Characium obovatum mihi.</explanationOfsubfigure> #aOmmitted #wrapperScore = 0.25
<explanationOfsubfigure>b. var. longipes mihi</explanationOfsubfigure> #bFirstLetter #wrapperScore = 0.25
</explanationOfFigure>

<explanationOfFigure>Fig 3.
<explanationOfsubfigure>Characium obovatum mihi.</explanationOfsubfigure> #mihi #wrapperPostScore = 0.50
<explanationOfsubfigure>b. var. longipes mihi</explanationOfsubfigure> #mihi #wrapperPostScore = 0.50 #subspecificEpithet = true
</explanationOfFigure>

<explanationOfFigure>Fig 3.
<explanationOfsubfigure>Characium obovatum mihi.</explanationOfsubfigure> #scientificName = Characium obovatum
<explanationOfsubfigure>b. var. longipes mihi</explanationOfsubfigure> #bFirstLetter #scientificNameAssembled = Characium obovatum var. longipes
</explanationOfFigure>

Does it make sense?

@dimus
Copy link
Member

dimus commented Aug 23, 2022

I think what you say is more of a job for gnfinder, because gnparser is designed to work with lists of already processed scientific names like personal checklists, databases, already extracted names. Adding contraints on what gnparser can do allows to decrease the number of false positives.

Lets say Characium obovatum mihi. b. var. longipes mihi is in a database. Parser would return:

http://parser.globalnames.org/?format=html&names=Characium+obovatum+mihi.+b.+var.+longipes+mihi&with_details=on

with lowest parsing quality 4 and 2 warnings: unparsed tail and ignored annotation, which would allow database or checklist curator to detect a problem, look at it and fix it by hand

{
  "parsed": true,
  "quality": 4,
  "qualityWarnings": [
    {
      "quality": 4,
      "warning": "Unparsed tail"
    },
    {
      "quality": 3,
      "warning": "Ignored annotation `mihi`"
    }
  ],
  "verbatim": "Characium obovatum mihi. b. var. longipes mihi",
  "normalized": "Characium obovatum",
  "canonical": {
    "stemmed": "Characium obouat",
    "simple": "Characium obovatum",
    "full": "Characium obovatum"
  },
  "cardinality": 2,
  "tail": " b. var. longipes mihi",
  "details": {
    "species": {
      "genus": "Characium",
      "species": "obovatum"
    }
  },
  "words": [
    {
      "verbatim": "Characium",
      "normalized": "Characium",
      "wordType": "GENUS",
      "start": 0,
      "end": 9
    },
    {
      "verbatim": "obovatum",
      "normalized": "obovatum",
      "wordType": "SPECIES",
      "start": 10,
      "end": 18
    }
  ],
  "id": "e65f7279-c3f1-5719-9058-a3c024719fde",
  "parserVersion": "v1.6.7"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants