-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to retain diacritics? #208
Comments
@tobymarsden could you post an example of names where you would use this option, and its effect in results? I guess I could be interested in using it but I don't really understand Go code. |
@tobymarsden I would also be interested to understand the usecase, ICZN does not allow diacritics, while in ICN there is a quite obscure permission to use 'é' in some very specific cases. |
I'm building two related things:
For (1), we have two issues: a) We have some names from a data source such as World Flora Online which contain diaereses, e.g. b) There are other names such as The main thing here is that we're working with "trusted" data, not cleaning up junk. When getting names from a somewhat authoritative source, we want to avoid providing an interpretation as far as possible. There may be a few problem names parsed, but that needs to be solved further up the stack, so to speak, and not in our parsing phase. Use case (2) is similar -- in our system, the user input is to be respected, at least where diaereses are concerned. If they want to refer to names like To sum up, I'm ambivalent about having an option to disable transliteration entirely, though this is simplest to implement and there's an argument that it's helpful when dealing with names from normally-reliable sources. However I do really need to be able to retain diaereses from the source data. (We'll actually end up using both -- matching on a version with no diaereses but normalizing for display with the original marks.) |
@tobymarsden I think I understood, so yes, lets add the flag. So in this case you would need to keep diacritics in normalized version, canonical forms (stemmed included), details? Your mention of different transliterations is also valid concern. I think we can talk about it at #201 |
My bad, I did not double check in the code, and trusted memory incorrectly, not 'é' but diaresis 'ë', |
@dimus Thanks! I think it does need to be everywhere, yes. On reflection I'm thinking that making the option "preserve diaereses" would be more conservative and less of a departure for gnparser as these are referenced in the ICN. More complex of course because it's transliterating everything that doesn't match (I think) I can only find a tiny handful of examples where other diacritics have been used in non-junk names anyway. |
There are definietly legacy names with diacritics, and other inconsistencies. For example Algaebase has a few names with capitalized epithets for "patronyms". One possible solution is to keep diaereses in I used to preserve |
Another complication are names with latinized german words where And of course, some people follow rules of transliteration and others dont, so we have several alternative spellings for legacy names with diacritics |
Would you be open to a PR which adds an option, disabled by default, to disable transliteration of diacritics? For my use case I'd strongly prefer that they were all retained, at least in genera and epithets.
The text was updated successfully, but these errors were encountered: