Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

formatted names not recognized #53

Open
Adafede opened this issue Jun 2, 2020 · 5 comments
Open

formatted names not recognized #53

Adafede opened this issue Jun 2, 2020 · 5 comments

Comments

@Adafede
Copy link

Adafede commented Jun 2, 2020

Hi again,

Using always more your amazing tool, I went through following issue:

When names are formatted, they do not get recognized.

Here are the different inputs:
input1.txt
input2.txt
input3.txt

and resulting outputs:

1

{ "metadata": { "date": "2020-06-02T12:34:06.580524+02:00", "gnfinderVersion": "v0.11.0", "withBayes": true, "tokensAround": 0, "language": "eng", "detectLanguage": false, "totalWords": 3, "totalCandidates": 1, "totalNames": 0 }, "names": null }

2

{ "metadata": { "date": "2020-06-02T12:35:54.584728+02:00", "gnfinderVersion": "v0.11.0", "withBayes": true, "tokensAround": 0, "language": "eng", "detectLanguage": false, "totalWords": 3, "totalCandidates": 1, "totalNames": 0 }, "names": null }

3

{
  "metadata": {
    "date": "2020-06-02T12:36:02.972624+02:00",
    "gnfinderVersion": "v0.11.0",
    "withBayes": true,
    "tokensAround": 0,
    "language": "eng",
    "detectLanguage": false,
    "totalWords": 3,
    "totalCandidates": 2,
    "totalNames": 1
  },
  "names": [
    {
      "cardinality": 2,
      "verbatim": "Zea mays",
      "name": "Zea mays",
      "odds": 1.0719384060700208,
      "start": 0,
      "end": 8,
      "annotationNomenType": "NO_ANNOT",
      "annotation": "",
      "verification": {
        "bestResult": {
          "dataSourceId": 1,
          "dataSourceTitle": "Catalogue of Life",
          "taxonId": "42981044",
          "matchedName": "Zea mays L.",
          "matchedCardinality": 2,
          "matchedCanonicalSimple": "Zea mays",
          "matchedCanonicalFull": "Zea mays",
          "classificationPath": "Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Zea|Zea mays",
          "classificationRank": "kingdom|phylum|class|order|family|genus|species",
          "classificationIds": "54767868|54767869|54770228|54770238|54770244|55061565|42981044",
          "matchType": "ExactCanonicalMatch"
        },
        "preferredResults": [
          {
            "dataSourceId": 1,
            "dataSourceTitle": "Catalogue of Life",
            "taxonId": "42981044",
            "matchedName": "Zea mays L.",
            "matchedCardinality": 2,
            "matchedCanonicalSimple": "Zea mays",
            "matchedCanonicalFull": "Zea mays",
            "classificationPath": "Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Zea|Zea mays",
            "classificationRank": "kingdom|phylum|class|order|family|genus|species",
            "classificationIds": "54767868|54767869|54770228|54770238|54770244|55061565|42981044",
            "matchType": "ExactCanonicalMatch"
          },
          {
            "dataSourceId": 11,
            "dataSourceTitle": "GBIF Backbone Taxonomy",
            "taxonId": "5290052",
            "matchedName": "Zea mays L.",
            "matchedCardinality": 2,
            "matchedCanonicalSimple": "Zea mays",
            "matchedCanonicalFull": "Zea mays",
            "classificationPath": "Plantae|Tracheophyta|Liliopsida|Poales|Poaceae|Zea|Zea mays",
            "classificationRank": "kingdom|phylum|class|order|family|genus|species",
            "classificationIds": "6|7707728|196|1369|3073|2705049|5290052",
            "matchType": "ExactCanonicalMatch"
          }
        ],
        "dataSourcesNum": 25,
        "dataSourceQuality": "HasCuratedSources",
        "retries": 1
      }
    }
  ]
}

Do you think it is easily doable to recognize them?
Otherwise I'll have to find a way of substracting the <i> </i> and so on before submitting the test to gnfinder.

@dimus
Copy link
Member

dimus commented Jun 15, 2020

1. [<i>Zea mays</i> Linné]
2. <i>Zea mays</i> Linné
3. Zea mays Linné

1 and 2 are not found, while 3 is found.

Hm, this is a grey area to me. I see gnfinder as a tool that finds names in
plain texts, other type of texts need to be converted to plain text before use.

For example it definitely does not support PDF, MS Doc, Excel spreasheets etc. Following this logic XML, HTML, JSON are marked up texts and need to be converted first to a plain text.

@Adafede
Copy link
Author

Adafede commented Jun 15, 2020

Hmmm ok...sad...

I thought rich text would have been ok...my bad then

Thank you for your answer!

@dimus
Copy link
Member

dimus commented Jun 15, 2020

From other side <i> tags in biological texts often indicate scientific names, so they might be a good thing to support.

@Adafede
Copy link
Author

Adafede commented Feb 22, 2022

Hi @dimus!
Trying to clean all my old issues... has this been somehow addressed with all the work you did lately?

Shall I keep it open?

@dimus
Copy link
Member

dimus commented Feb 26, 2022

Yes, please keep it open, I did not get to it yet, was concentrated on gnverifier for a while. I do want to find a good solution for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants