-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web: Locale aware search of IdP #193
Comments
I'm not sure this is actually a code question. You probably do not have admin level access to CAT yourself? Then you wouldn't know that the CAT admin is free to add a description of their IdP in any language they want. Our admin-side is UTF-8 clean, so all the exact spellings anyone can care about can be provided. I'm attaching a screenshot. I see that Goettingen/Göttingen has done that and provided their name in English and German. Consequently, I fail to see a problem in your concrete example of Goe/öttingen. Setting my browser to German, and searching for Göttingen, I do find Göttingen. Setting it to English, I find the IdP by searching for Goettingen. IOW, we are doing what a good web service is supposed to do: deliver you localised content based on your indicated language preference. I believe your issue would only surface if you set your browser language preference to non-German and look for "Göttingen" or set it to German and look for "Goettingen". My question would then be... why do you do that? The reason why the issue looks more severe for Münster is because their administrators chose NOT to provide a name variant in English, so the literal "Münster" is all you can search for. In this case, my primary reaction would be that someone approaches the admins of Münster to provide an English (or as a catch-all "default/other language") alternative. With all that said, there is of course nothing wrong with providing users with yet another fallback, let's call it "cross-language search". But, with all due respect: your suggestion is very locale-specific. eduroam is present in more than 100 countries and CAT alone is localised into approx. 15 languages. All of those have their own characters and transformation rules. I would be very much against introducing hacks specific to the German language. So, if you have a suggestion which is more general than what you suggested in the issue, I think we'd be happy to consider. |
You are right, I'm not a CAT admin myself, but in cooperation with our admin I had a look at this issue and he told me about the possibility to provide different names for different languages. Furthermore, we are aware that one can change the language on the site to get different versions of the names, but apparently not all end-users are aware of this mechanism. There had been issues with this, where users were not able to find our organization due to a wrong language/search query.
To be honest: I don't know. In first-level-support I hear about a lot of weird questions and issues. Not all users are as tech savvy as we might be, so this fallback would be an improvement to the overall UX and might prevent some avoidable questions from the users. I am well aware that my suggestion is highly german specific, while eduroam is an international project. That's why I opened this issue and wanted to hear opinions from you people, who are more experienced with the codebase and might know a better place to implement this cross-language search. In my opinion this would benefit all languages with special characters. |
Okay... right now we send a list of institution names /in the language best matching the browser settings/ to JS which then does the search. An alternative would be to push /all/ languages the admin has defined and search in that bigger list. This would solve the issue for Göttingen, but would not for Münster. But if it is annoying to Münster then they have an easy way to fix... Would you feel that this kind of solution is good enough? |
The side effect of this approach would be more stuff being sent to the browser (probably twice of what we send now) this is the whole list of all CAT IdPs. The current size is about 550k. This also means loading more into browser memory (not sure how significant it is). |
I like the simplicity of your Idea @restena-sw , but the impact on the data size is also my biggest concern with it. To avoid this it would be best to do the filtering on the server. I.e. to send the current search input to the server (t.ex. on 'onChange' Event or the like), filter it in the database and send the matching entries to the client. |
It would solve our problem and we would happily click on "close the issue successfully". :) The idea from @gitschneider and me was basically to have a more 'fuzzy' search like it is common nowadays. |
Fuzzy search that monitor.eduroam.org presents for login is making me crazy, I can never guess where the results come from, so one needs to be careful to to overdo things. |
Good point. But usually it is sorted best fit first and the fuzzier things later. And fuzzy doesn't mean 'not stable' ;). |
And that is great. And also this is not the problem :).
The problem is that foreign students/scientists usually can't type 'ö' on their keyboard. But some do, even English-Speaking people with an English locale. Then some try to use 'o' other use 'oe' as there is no internationally accepted normalization for this. The only possibility to solve this at CAT-admin level at the moment would be to write "University of Göttingen, Gottingen, Goettingen" into the IdP-Name which looks pretty stupid. :) |
Okay, that sounds like a problem some people might have :-) So I've looked for a JS way of doing this kind of fuzzy string search ( https://en.wikipedia.org/wiki/Approximate_string_matching ) and one of the references of the Wikipedia article is a JavaScript library doing that with Levenshtein string distance: https://github.com/NaturalNode/natural#approximate-string-matching We could first do a literal search, and if there are no matches, compute the 1-distance strings, display those, and so on. An ö -> oe edit scores two. It has the advantage of not needing to send more data over the wire, but the disadvantage of using some (significant?) compute on the client device. What I don't know is how well this can be integrated in the overall discovery search & display - the code is originally an external library which we would need to modify locally (DiscoJuice). |
I have tested adding keywords (as another language name version). This actually works quite wonderfully (thanks again Andreas for DJ). I will push the update after doing some optimization. This is no a fuzzy matching it only compares against the stored values in all languages. |
I think that adding the extra library would be doable but whether it would save us data would depend on its size. DJ caches its feed therefore it just gets it twice. If instead of extra data it needs to pull extra code wi might come out even. |
This would solve our problem. We could add Göttingen, Gottingen and Goettingen to other languages and come out fine.
We opened this ticket and offered help because we see exactly this happening quite often (especially to ViP people...). And as Stefan found: We already experimented with different names.
I think this is what people are used to nowadays as google at al are working that way. |
I wonder if these mapping rules for umlauts work the same for Nordic, Hungarian languages |
The world would be significantly simpler without VIPs :) |
That's the fun thing... ...it differs. The "official" rule for German is to change ö to 'oe'. In Swedish afaik they simply leave out the diacritics. So å -> a. We should never have left the ASCII domain ;). |
With the levenshtein approach this should all work ok. let lev = require("js-levenshtein")
lev('Høgskolen','Hogskolen')
// 1
lev('Malmö','Malmo')
// 1
lev('Skåne', 'Skane')
// 1
lev('Šiauliai', 'Siauliai')
// 1 Another approach would be to normalize such diacritics/accents. According to this SO post a combination of String.prototype.normalize() and a regex replace normalizes such characters away, which could be used in the first step, the literal search. 'Göttingen'.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
// 'Gottingen'
'Malmö'.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
// 'Malmo'
'Střední škola'.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
// 'Stredni skola' |
Actually the whole thing is much wider than just the accents. |
I just want to point out that this issue was tackled in the past using techniques such as Soundex (https://en.wikipedia.org/wiki/Soundex). I haven't personally used any of those algorithms (Soundex, Daitch–Mokotoff, Metaphone, Levenshtein, etc.), but maybe are worth trying. Specially if there are existing libraries implementing them. Also: "As Stefan has pointed out to me a visiting student standing in front of a sign with the Polish name". AFAIK, a visiting student (not a exchange student) have no reason to use any guide or profile from the visited organization. |
As to Also: "As Stefan has pointed out to me a visiting student standing in front of a sign with the Polish name". AFAIK, a visiting student (not a exchange student) have no reason to use any guide or profile from the visited organization. |
Yes. Actually the visiting students here get local accounts here most of the time as not all lecturing systems are eduGAIN enabled... and then they naturally also try to use the eduroam with their account (e.g. because only with a local account because of $REASON they are allowed to read eJournals in our Wifi). But this only as a side note ;). |
Can you take a look at https://cat.eduroam.pl/tmw-rel_2_0/ |
Thanks, this looks very good. We also think that the increased size is justified, so we're good with this solution. |
Issue type
Defect/Feature description
Due to different locales and special characters some organisations are not found with every language setting of the website.
For Example: When set to english, there is the 'University Goettingen' which cannot be found by the actually correct spelling 'Göttingen' (notice the umlaut).
Likewise, the 'Universität Münster' cannot be found by entering 'Muenster'.
It would be great if the different spellings of the Umlauts (t.e. ü => ue, ö => oe) did not distort the search results;.
Detail of issue
For various reasons this decreases the UX of the site. Not all participants have can easily enter the special characters, others assume a spelling which is not actually provided.
There are mechanisms built into JS to do a locale aware string comparison with localeCompare.
An alternative could be to normalize all inputs to the so called german phonebook collation (ü => ue, etc...) and compare those entries.
In either case the outcome would benefit other languages probably too.
As I am not really familiar with the codebase, I would like to hear your opinion on how it could be best accomplished.
The text was updated successfully, but these errors were encountered: