Awesome Kyrgyz NLP

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

Repository's owners explicitly say that "this library is not maintained".
Not committed to for a long time (2~3 years).

Datasets

Corpora

Manas-UdS: 1.2M words, 84 literary texts, 5 genres: novel, novelette, epic, minor epic, and fairy tale; lemmata, PoS tags, rich per-text metadata.
kyWaC: Kyrgyz corpus from the web, 19M words, Jan 2012 [not open]
Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
TilCorpusu: Kyrgyz corpus, 100M words, news+fiction, made public in July 2023 (just the News part due to legal restrictions)
TurkLang-7: parallel corpora mentioned in the 2020 work 'First Results of the ``TurkLang-7'' Project: Creating Russian-Turkic Parallel Corpora and MT Systems' by Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., Abdurakhmonova, N. [status?]

Character recognition

Kyrgyz language hand-written letters (Kyrgyz MNIST): hand-written Kyrgyz alphabet letters collection for machine learning applications; original images (a total of 80213) have been transformed to 50x50 images, then to CSV format

Raw text

kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Morphology & Syntax

UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
KTMU's UD Treebank, 781 sentences; UPD: now even more sentences! + some fixes in the previous version of the dataset
Small UD Treebank: 145 sentences (incl. 20 Cairo sentences), and ~ 100 sentences suggested by UD Turkic Group; a part of UD Turkic Treebank; also note that the translations to English, Azerbaijani and Turkish are available
Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff

Named Entity Recognition

WikiANN has a Kyrgyz language part
KyrgyzNER: [not published yet]

Text Classification

Kyrgyz Multi-Label News Classification: [not published yet]

Word Similarity Data

Kyrgyz Word Embedding Evaluation: [not published yet]; the 2 best models have been released

Instructions

Machine-Translated Alpaca: Stanford Alpaca instructions translated into Kyrgyz using ChatGPT and Google Translate

Machine-readable dictionaries

Country names table: Kyrgyz-Russian-English
Thesaurus KyrSpell (however, unpacking it seems to break the license)
Tatu Ylonen's enwiktionary-based dictionary (also please see the derived En-Ky Anki deck for language learners)

Pretrained models

Polyglot morfessor — pretrained morfessor model, number 6
fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
compressed fastText — fasttext-ky-mini prepared by Liebl Bernhard in 2021.
fastText trained on Leipzig Corpora — best-performant 100/300-dimensional fastText vectors provided by the authors of the HJ-Ky-0.1 paper.
fastText from Kuriyozov et al.'2020 — trained on SketchEngine's KyWaC
BERT-based NER — bert-base-multilingual-cased fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.
Manas-GPT — Janar Osmonaliev's fun personal project: training nanoGPT on Sayakbai Karalaev's version of Epic of Manas

Methods/Software

spaCy basic support: tokenization, stopwords, like_num
stanza-ky pipeline called 'ktmu'; use with care, seems to have a very suspicious brackets processing
kyrgyz-nlp/disambiguator project studies the ability of popular embedding models to select word senses based on the word hints (anchor words)

Morphology

Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: install_apertium_kir.sh. A much, much easier way: import apertium; apertium.installer.install_module("kir").
[DEPRECATED] kymopl: Kyrgyz morphology in Prolog

Hate Speech detection

Jupyter Notebook for hate speech detection

Other

Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
ӨҮҢизатор: a proof-of-concept letter replacement Telegram bot demo code, fixes incorrect usages of 'О','У', 'Н' => 'Ө', 'Ү','Ң'
Number-to-words conversion (JavaScript) by @AzamatSooldaev
Number-to-words conversion (TypeScript) by @timursaurus
Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz

Online Demos

Cyrillic-to-Latin online converter based on this resource.

Miscellaneous

Kyrgyz NLP bibliography: kyrgyznlp.github.io
Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
A useful Apertium's list of tools and other resources
Online dictionaries and other useful resources on el-sozduk.kg
Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
Data prepared by CSLT: 128h speech, 163 speakers (100m/63f), transcription of the speech audio, lexicon in the word level; link (requires extra steps, quote: You should ask for license before you can download the datasets. Please send Email to shiying@cslt.org or lilt@cslt.org to get the license.)

Contributions to this list

@golden-ratio

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
download_kyr_spell		download_kyr_spell
download_wiki_dump		download_wiki_dump
.gitignore		.gitignore
README.md		README.md
install_apertium_kir.sh		install_apertium_kir.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Kyrgyz NLP

Table of Contents

Datasets

Corpora

Character recognition

Raw text

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity Data

Instructions

Machine-readable dictionaries

Pretrained models

Methods/Software

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous

Contributions to this list

About

Contributors 2

Languages

alexeyev/awesome-kyrgyz-nlp

Folders and files

Latest commit

History

Repository files navigation

Awesome Kyrgyz NLP

Table of Contents

Datasets

Corpora

Character recognition

Raw text

Morphology & Syntax

Named Entity Recognition

Text Classification

Word Similarity Data

Instructions

Machine-readable dictionaries

Pretrained models

Methods/Software

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous

Contributions to this list

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages