Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web: Locale aware search of IdP #193

Open
2 of 4 tasks
gitschneider opened this issue Jul 21, 2020 · 23 comments
Open
2 of 4 tasks

Web: Locale aware search of IdP #193

gitschneider opened this issue Jul 21, 2020 · 23 comments
Assignees

Comments

@gitschneider
Copy link

Issue type

  • Defect - Crash/memory corruption.
  • Defect - Non-compliance with a standards document or incorrect OS API usage.
  • Defect - Unexpected behaviour (obvious or has been verified by a project member).
  • New feature request.

Defect/Feature description

Due to different locales and special characters some organisations are not found with every language setting of the website.
For Example: When set to english, there is the 'University Goettingen' which cannot be found by the actually correct spelling 'Göttingen' (notice the umlaut).
Likewise, the 'Universität Münster' cannot be found by entering 'Muenster'.
It would be great if the different spellings of the Umlauts (t.e. ü => ue, ö => oe) did not distort the search results;.

Detail of issue

For various reasons this decreases the UX of the site. Not all participants have can easily enter the special characters, others assume a spelling which is not actually provided.

There are mechanisms built into JS to do a locale aware string comparison with localeCompare.

'oe'.localeCompare('ö', [ 'de-DE-u-co-phonebk'], { sensitivity: 'base', usage: 'search'})
// 0
'ö'.localeCompare('ö', [ 'de-DE-u-co-phonebk'], { sensitivity: 'base', usage: 'search'})
// 0

An alternative could be to normalize all inputs to the so called german phonebook collation (ü => ue, etc...) and compare those entries.

In either case the outcome would benefit other languages probably too.

As I am not really familiar with the codebase, I would like to hear your opinion on how it could be best accomplished.

@restena-sw
Copy link
Contributor

I'm not sure this is actually a code question. You probably do not have admin level access to CAT yourself? Then you wouldn't know that the CAT admin is free to add a description of their IdP in any language they want. Our admin-side is UTF-8 clean, so all the exact spellings anyone can care about can be provided. I'm attaching a screenshot.

Screenshot_20200722_091745

I see that Goettingen/Göttingen has done that and provided their name in English and German.

Consequently, I fail to see a problem in your concrete example of Goe/öttingen. Setting my browser to German, and searching for Göttingen, I do find Göttingen. Setting it to English, I find the IdP by searching for Goettingen.

IOW, we are doing what a good web service is supposed to do: deliver you localised content based on your indicated language preference.

I believe your issue would only surface if you set your browser language preference to non-German and look for "Göttingen" or set it to German and look for "Goettingen". My question would then be... why do you do that?

The reason why the issue looks more severe for Münster is because their administrators chose NOT to provide a name variant in English, so the literal "Münster" is all you can search for. In this case, my primary reaction would be that someone approaches the admins of Münster to provide an English (or as a catch-all "default/other language") alternative.

With all that said, there is of course nothing wrong with providing users with yet another fallback, let's call it "cross-language search".

But, with all due respect: your suggestion is very locale-specific. eduroam is present in more than 100 countries and CAT alone is localised into approx. 15 languages. All of those have their own characters and transformation rules. I would be very much against introducing hacks specific to the German language. So, if you have a suggestion which is more general than what you suggested in the issue, I think we'd be happy to consider.

@gitschneider
Copy link
Author

You are right, I'm not a CAT admin myself, but in cooperation with our admin I had a look at this issue and he told me about the possibility to provide different names for different languages.

Furthermore, we are aware that one can change the language on the site to get different versions of the names, but apparently not all end-users are aware of this mechanism. There had been issues with this, where users were not able to find our organization due to a wrong language/search query.

My question would then be... why do you do that?

To be honest: I don't know. In first-level-support I hear about a lot of weird questions and issues. Not all users are as tech savvy as we might be, so this fallback would be an improvement to the overall UX and might prevent some avoidable questions from the users.

I am well aware that my suggestion is highly german specific, while eduroam is an international project. That's why I opened this issue and wanted to hear opinions from you people, who are more experienced with the codebase and might know a better place to implement this cross-language search. In my opinion this would benefit all languages with special characters.

@restena-sw
Copy link
Contributor

Okay... right now we send a list of institution names /in the language best matching the browser settings/ to JS which then does the search.

An alternative would be to push /all/ languages the admin has defined and search in that bigger list.

This would solve the issue for Göttingen, but would not for Münster. But if it is annoying to Münster then they have an easy way to fix...

Would you feel that this kind of solution is good enough?

@twoln
Copy link
Contributor

twoln commented Jul 22, 2020

The side effect of this approach would be more stuff being sent to the browser (probably twice of what we send now) this is the whole list of all CAT IdPs. The current size is about 550k. This also means loading more into browser memory (not sure how significant it is).

@gitschneider
Copy link
Author

I like the simplicity of your Idea @restena-sw , but the impact on the data size is also my biggest concern with it.

To avoid this it would be best to do the filtering on the server. I.e. to send the current search input to the server (t.ex. on 'onChange' Event or the like), filter it in the database and send the matching entries to the client.
This would benefit the transferred data amount, but would mean of course a slightly higher server load.

@sklemer1
Copy link
Contributor

Would you feel that this kind of solution is good enough?

It would solve our problem and we would happily click on "close the issue successfully". :)

The idea from @gitschneider and me was basically to have a more 'fuzzy' search like it is common nowadays.

@twoln
Copy link
Contributor

twoln commented Jul 22, 2020

Fuzzy search that monitor.eduroam.org presents for login is making me crazy, I can never guess where the results come from, so one needs to be careful to to overdo things.

@sklemer1
Copy link
Contributor

Good point. But usually it is sorted best fit first and the fuzzier things later. And fuzzy doesn't mean 'not stable' ;).

@sklemer1
Copy link
Contributor

IOW, we are doing what a good web service is supposed to do: deliver you localised content based on your indicated language preference.

And that is great. And also this is not the problem :).

I believe your issue would only surface if you set your browser language preference to non-German and look for
"Göttingen" or set it to German and look for "Goettingen". My question would then be... why do you do that?

The problem is that foreign students/scientists usually can't type 'ö' on their keyboard. But some do, even English-Speaking people with an English locale. Then some try to use 'o' other use 'oe' as there is no internationally accepted normalization for this. The only possibility to solve this at CAT-admin level at the moment would be to write "University of Göttingen, Gottingen, Goettingen" into the IdP-Name which looks pretty stupid. :)

@restena-sw
Copy link
Contributor

Okay, that sounds like a problem some people might have :-)

So I've looked for a JS way of doing this kind of fuzzy string search ( https://en.wikipedia.org/wiki/Approximate_string_matching ) and one of the references of the Wikipedia article is a JavaScript library doing that with Levenshtein string distance:

https://github.com/NaturalNode/natural#approximate-string-matching

We could first do a literal search, and if there are no matches, compute the 1-distance strings, display those, and so on. An ö -> oe edit scores two.

It has the advantage of not needing to send more data over the wire, but the disadvantage of using some (significant?) compute on the client device.

What I don't know is how well this can be integrated in the overall discovery search & display - the code is originally an external library which we would need to modify locally (DiscoJuice).

@twoln
Copy link
Contributor

twoln commented Jul 23, 2020

I have tested adding keywords (as another language name version). This actually works quite wonderfully (thanks again Andreas for DJ). I will push the update after doing some optimization. This is no a fuzzy matching it only compares against the stored values in all languages.
However I do not quite buy the argument about people no being able to add accents. Such people are most likely using their own or English locale and it should be up to the IdP admin to provide a name in a version fitting an English locale, i.e. no accents. In such a situation the search should work fine also in the current implementation. If the admin does not provide such names then also the new addition will not be able to help.

@twoln
Copy link
Contributor

twoln commented Jul 23, 2020

I think that adding the extra library would be doable but whether it would save us data would depend on its size. DJ caches its feed therefore it just gets it twice. If instead of extra data it needs to pull extra code wi might come out even.

@sklemer1
Copy link
Contributor

sklemer1 commented Jul 23, 2020

I have tested adding keywords (as another language name version)

This would solve our problem. We could add Göttingen, Gottingen and Goettingen to other languages and come out fine.

However I do not quite buy the argument about people no being able to add accents. Such people are most likely using their own or English locale and it should be up to the IdP admin to provide a name in a version fitting an English locale, i.e. no accents. In such a situation the search should work fine also in the current implementation.

We opened this ticket and offered help because we see exactly this happening quite often (especially to ViP people...). And as Stefan found: We already experimented with different names.

We could first do a literal search, and if there are no matches, compute the 1-distance strings, display those, and so on. An ö -> oe edit scores two.

I think this is what people are used to nowadays as google at al are working that way.

@twoln
Copy link
Contributor

twoln commented Jul 23, 2020

I wonder if these mapping rules for umlauts work the same for Nordic, Hungarian languages

@twoln
Copy link
Contributor

twoln commented Jul 23, 2020

The world would be significantly simpler without VIPs :)

@sklemer1
Copy link
Contributor

I wonder if these mapping rules for umlauts work the same for Nordic, Hungarian languages

That's the fun thing... ...it differs. The "official" rule for German is to change ö to 'oe'. In Swedish afaik they simply leave out the diacritics. So å -> a.

We should never have left the ASCII domain ;).

@gitschneider
Copy link
Author

I wonder if these mapping rules for umlauts work the same for Nordic, Hungarian languages

With the levenshtein approach this should all work ok.

let lev = require("js-levenshtein")
lev('Høgskolen','Hogskolen')
// 1
lev('Malmö','Malmo')
// 1
lev('Skåne', 'Skane')
// 1
lev('Šiauliai', 'Siauliai')
// 1

Another approach would be to normalize such diacritics/accents. According to this SO post a combination of String.prototype.normalize() and a regex replace normalizes such characters away, which could be used in the first step, the literal search.
If the IDPs are also normalized, we could find many entries without the levenshtein fallback. For example:

'Göttingen'.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
// 'Gottingen'
'Malmö'.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
// 'Malmo'
'Střední škola'.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
// 'Stredni skola'

@twoln
Copy link
Contributor

twoln commented Jul 24, 2020

Actually the whole thing is much wider than just the accents.
Taking my university as an example, in Polish it is called "Uniwersytet Mikołaja Kopernika" in English "Nicolas Copernicus University". As Stefan has pointed out to me a visiting student standing in front of a sign with the Polish name might try "Uniwersytet Mikolaja Kopernika" in his/her browser set to a locale naturally other than PL. This would not work even if just "Kopernika" is entered. There are plenty other examples like that in Poland and I am sure also in other countries.
The only way to get this handled is by providing all name variants from the server. I am now experimenting with generating strings taken from all names entered into CAT, applying iconv to ASCI (using the selected locale) deleting the duplicates and sending the resulting stuff as keywords. This works quite well and probably covers all search problems, but of course does require the extra data.
I am as yet to test what this would do in size to the current production feed and how efficient the matching will work.
Of course the middle ground would be to that the normalizing approach, i.e. normalize the keywords and moralize the search string. This would lower the data size significantly while preserving the multi-language variants.

@Turin86
Copy link

Turin86 commented Jul 27, 2020

I just want to point out that this issue was tackled in the past using techniques such as Soundex (https://en.wikipedia.org/wiki/Soundex). I haven't personally used any of those algorithms (Soundex, Daitch–Mokotoff, Metaphone, Levenshtein, etc.), but maybe are worth trying. Specially if there are existing libraries implementing them.

Also: "As Stefan has pointed out to me a visiting student standing in front of a sign with the Polish name". AFAIK, a visiting student (not a exchange student) have no reason to use any guide or profile from the visited organization.

@twoln
Copy link
Contributor

twoln commented Jul 28, 2020

As to Also: "As Stefan has pointed out to me a visiting student standing in front of a sign with the Polish name". AFAIK, a visiting student (not a exchange student) have no reason to use any guide or profile from the visited organization.
In ideal world yes, in reality there still exist students who come from institutions not participating in eduroam. these students get full access rights as students of our university therefore may want to use it as their home IdP.

@sklemer1
Copy link
Contributor

Yes. Actually the visiting students here get local accounts here most of the time as not all lecturing systems are eduGAIN enabled... and then they naturally also try to use the eduroam with their account (e.g. because only with a local account because of $REASON they are allowed to read eJournals in our Wifi). But this only as a side note ;).

@twoln
Copy link
Contributor

twoln commented Aug 10, 2020

Can you take a look at https://cat.eduroam.pl/tmw-rel_2_0/
This ie entirely server based enhanced search using the keywords concept form DiscoJuice. With duplicate elimination the file growth is 43% from 557k to 826k.
This also takes care of the fact that institutions names vary in different languages, as I have mentioned in an earlier comment.
This CAT instance does not have all languages working, however keywords will behave differently depending on the language you select. If you choose German than Muenchen will show matches but Munchen will not, if you select English or Polish or probably anything elce the opposite will be true, you will get matches for Munchen but not for Muenchen. I guess people will not enter Muenchen in other language variants putting Munich instead.
Selecting the German language will still allow you to use Munich as a search term.
I think that this functionality justifies the extra bytes of the load.

@gitschneider
Copy link
Author

Thanks, this looks very good. We also think that the increased size is justified, so we're good with this solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants