-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathdescription.txt
17 lines (13 loc) · 3.33 KB
/
description.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
This website contains the Thai Corpus, at the moment including about (…) million tokens. The texts collected for the corpora, news articles mostly, are furnished with user-friendly morphological markup consisting of tags assigned to individual tokens.
The Thai Corpus is designed both for linguists who study the issues related to Thai language and for anyone interested in Thai language.
The Thai Corpus employs the search engine of the Eastern Armenian National Corpus (EANC). The Thai Corpus is being developed by the team of students of HSE School of Linguistics in Moscow under the guidance of professor Boris Orekhov. The team consisted of Ignatyev Grigory, Ershova Alexandra, Kuznetsova Anna, Shalganova Tatyana, Kolomeytsev Daniil and Mikulin Nikolai. The consulting help on Thai language was provided by Nadezhda Motina.
Natalia Filippova, Elizaveta Kuzmenko, Tatyana Gavrilova, Elena Krotova, Elmira Mustakimova, Olga Sozinova, Aleksandra Martynova, Maria Sheyanova, Marina Kustova and Julia Badryzlova also contributed to the project.
***
This website gives access to the HSE Thai Corpus - the corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). Each token was assigned it's English translation and part of speech tag. Some other grammatical tagging also was assigned where suitable.
HSE Thai Corpus can be used both by native speakers of Thai and any English-speaking users since every recognized word is given it's English translation. It is a useful tool for linguists and basically anyone who interests themselves in Thai language. The corpus is suitable for lexical, syntactic and other sinchronical studies and, due to it's volume, can provide researchers with a huge amount of data.
The corpus employs the search engine of the Eastern Armenian National Corpus (EANC) (http://eanc.net/). The user-friendly and flexible search system allows users to gather material by grammatical and POS tags alongside with translations and, of course, actual wordforms. To make it easier for non-Thai-speakers to comprehend and use texts in the corpus we decided to separate words in each sentence with spaces.
The Thai Corpus is being developed by the team of students of HSE School of Linguistics in Moscow under the guidance of professor Boris Orekhov. The team consisted of Ignatyev Grigory, Ershova Alexandra, Kuznetsova Anna, Shalganova Tatyana, Kolomeytsev Daniil and Mikulin Nikolai. The consulting help on Thai language was provided by Nadezhda Motina.
Natalia Filippova, Elizaveta Kuzmenko, Tatyana Gavrilova, Elena Krotova, Elmira Mustakimova, Olga Sozinova, Aleksandra Martynova, Maria Sheyanova, Marina Kustova and Julia Badryzlova also contributed to the project.
The data for the corpus was collected by means of Scrapy (http://scrapy.org/). To tokenize texts a special pythai() module was used. The tagging was based on the material of two english-thai dictionaries: online thai dictionary (http://www.thai-language.com/) and thai dictionary 2 (https://github.com/veer66/Yaitron/tree/master/data).
All materials and scripts connected to this project are available on github (https://github.com/nevmenandr/thai-language) (in Russia).
In total we have downloaded and tagged texts containing 200 mln tokens (the corpus contains only 50 mln). The texts will accessible on ().