-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WikiCorpus: select which texts fields will be tokenized: only the body, or only the title, or title and body. #3604
Comments
I'm not fully clear what you mean. Would your (2) "only the title" option not include the body text at all? Does (3) in practice mean just a tokenized title (& a newline or 2 as the "full stop" you mention) prepended to the body? A compact PR with the new capability is usually the clearest way to show intent & impact. If it's of sufficient simplicity (clear, matches existing style, low risk of future maintenance costs), & of potentially wider use, it could be integrated in a future release. OTOH, if it's something people can easily do for themselves with a few lines of wrapping code over what is already returned, such functionality might be better kept as an example usage or outside recipe. As a more-general note: it's been a few years since I was feeding Wikipedia text to Gensim, but when I did, I was sufficiently frustrated with the limits of
|
To illustrate,, see the example below: <page>
<title>Alves to leave FC Barcelona</title>
....
<revision>
....
<text ...>{{date|June 4, 2016}}
{{infobox|FC Barcelona}}
On Thursday, [[FC Barcelona]] technical secretary {{w|Roberto Fernández Bonillo|Robert Fernández}} announced [[Brazil]]lian [[football (soccer)|football]] defender [[Dani Alves]] would leave the club this summer as a free agent.
[[File:2015 UEFA Super Cup 107.jpg|left|thumb|File photo of Dani Alves{{image|Олег Дубина}}]]
Alves spent eight seasons with the [[Catalonia]]ns when the club signed him from {{w|Sevilla FC}}. Making his debut for Barça in 2008, Alves played 391 matches and he played the second most games of any foreign player in the Catalan jersey after [[Lionel Messi]]. In his last league appearance for Barça, he provided an assist to [[Luis Suárez]] which was Alves's 100th assist in [[La Liga]].
....
FC Barcelona has invited him to address club supporters at the 2016&ndash;17 season start in farewell.
</text>
</revision>
</page> I need to extract and tokenize the title and text together (or just the title). So, a proposed solution would be to make the modification below in the fields = tuple(fields.split()) if isinstance(fields, str) else fields
fields = set(fields) if fields else None
TEXT_OR_BODY = set(("text", "body"))
text, title, pageid = args
text_content = filter_wiki(text) if (fields is None) or (fields.intersection(TEXT_OR_BODY)) else ""
text_content = f"{title} . {text_content}" if ('title' in fields) else text_content
text_content = text_content.strip()
result = tokenizer_func(text_content, token_min_len, token_max_len, lower)
return result, title, pageid This change is in the forked repo https://github.com/LINE-PESC/gensim |
Currently, in WikiCorpus the article title is not being tokenized, and this field is the main text about the articles from WikiNews.
So, one proposal would be to adjust it to allow the choice of which text field to tokenize, with at least 3 options (and it can be flexible):
In the case of the last option (title and body), the text to be tokenized must first be composed of the title, followed by a separate (full stop) and, finally, the body.
The text was updated successfully, but these errors were encountered: