Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to retain diacritics? #208

Closed
tobymarsden opened this issue Nov 15, 2021 · 8 comments
Closed

Option to retain diacritics? #208

tobymarsden opened this issue Nov 15, 2021 · 8 comments

Comments

@tobymarsden
Copy link

Would you be open to a PR which adds an option, disabled by default, to disable transliteration of diacritics? For my use case I'd strongly prefer that they were all retained, at least in genera and epithets.

@abubelinha
Copy link

abubelinha commented Nov 17, 2021

@tobymarsden could you post an example of names where you would use this option, and its effect in results?

I guess I could be interested in using it but I don't really understand Go code.
@abubelinha

@dimus
Copy link
Member

dimus commented Nov 17, 2021

@tobymarsden I would also be interested to understand the usecase, ICZN does not allow diacritics, while in ICN there is a quite obscure permission to use 'é' in some very specific cases.

@tobymarsden
Copy link
Author

@dimus @abubelinha

I'm building two related things:

  1. a programmatic interface to multiple "trusted" (but sometimes conflicting!) sources of plant data -- for example, Plants of the World Online, World Flora Online, Red List, CITES.
  2. a system for maintaining data on living botanical collections, which accepts names as input from the user in order to, for example, accession new material and associate it with a taxon and then a name.

For (1), we have two issues:

a) We have some names from a data source such as World Flora Online which contain diaereses, e.g. Hieracium kalsoeënse. The use of diaereses is permitted under the ICN. Transliterating it to e is reasonable as the mark doesn't change the spelling, and this is useful for matching purposes. But leaving it as-is (particularly for display purposes) is also reasonable because a reliable source included it and the ICN allows it. A flag allows the user to make the judgement according to their use case.

b) There are other names such as Anthurium gudiñoi which are not valid under the ICN, but they were still referenced somewhere notable -- in this case, on a type specimen sheet in the herbarium at Missouri. Normalizing the name in every respect other than transliteration would still be useful, though I can't get particularly exercised about it. Similarly with Senecio nordenskjöldii -- there are many more sources which reference Senecio nordenskjoldii than Senecio nordenskjoeldii, so the transliteration to oe doesn't help here when matching names.

The main thing here is that we're working with "trusted" data, not cleaning up junk. When getting names from a somewhat authoritative source, we want to avoid providing an interpretation as far as possible. There may be a few problem names parsed, but that needs to be solved further up the stack, so to speak, and not in our parsing phase.

Use case (2) is similar -- in our system, the user input is to be respected, at least where diaereses are concerned. If they want to refer to names like Hieracium kalsoeënse, that's fine and we need to be able to normalize that without removing the diaeresis, which would be overstepping.

To sum up, I'm ambivalent about having an option to disable transliteration entirely, though this is simplest to implement and there's an argument that it's helpful when dealing with names from normally-reliable sources. However I do really need to be able to retain diaereses from the source data. (We'll actually end up using both -- matching on a version with no diaereses but normalizing for display with the original marks.)

@dimus
Copy link
Member

dimus commented Nov 17, 2021

@tobymarsden I think I understood, so yes, lets add the flag.

So in this case you would need to keep diacritics in normalized version, canonical forms (stemmed included), details?

Your mention of different transliterations is also valid concern. I think we can talk about it at #201

@dimus
Copy link
Member

dimus commented Nov 18, 2021

The use of diaereses is permitted under the ICN.

My bad, I did not double check in the code, and trusted memory incorrectly, not 'é' but diaresis 'ë',

@tobymarsden
Copy link
Author

@dimus Thanks!

I think it does need to be everywhere, yes.

On reflection I'm thinking that making the option "preserve diaereses" would be more conservative and less of a departure for gnparser as these are referenced in the ICN. More complex of course because it's transliterating everything that doesn't match (I think) [aeiou][ëï] but I could give it a shot and you can see what you think.

I can only find a tiny handful of examples where other diacritics have been used in non-junk names anyway.

@dimus
Copy link
Member

dimus commented Nov 18, 2021

There are definietly legacy names with diacritics, and other inconsistencies. For example Algaebase has a few names with capitalized epithets for "patronyms".

One possible solution is to keep diaereses in normalized and canonical full and, may be canonical simple, but remove it from canonical stemmed. Also keep it in detail.

I used to preserve ë for for all parsed names, but it is not compatible with ICZN names, so now I remove it.

@dimus
Copy link
Member

dimus commented Nov 18, 2021

Another complication are names with latinized german words where diaeresis umlaut characters ö, ä, ü, do suppose to change spelling during transliteration, so I think only ë is safe.

And of course, some people follow rules of transliteration and others dont, so we have several alternative spellings for legacy names with diacritics

tobymarsden added a commit to amazingplants/gnparser that referenced this issue Nov 19, 2021
tobymarsden added a commit to amazingplants/gnparser that referenced this issue Nov 19, 2021
tobymarsden added a commit to amazingplants/gnparser that referenced this issue Nov 19, 2021
tobymarsden added a commit to amazingplants/gnparser that referenced this issue Nov 19, 2021
@dimus dimus closed this as completed in 403deab Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants