Normatex - Russian text normalization

This is a set of Finite-State Transducers (FSTs) for normalization of Russian texts for speech synthesis, machine translation and other natural language processing tasks.

The FSTs are developed using Unitex, a corpus processor.

To normalize a Russian text:

Copy your text (e.g. example.txt) to Corpus folder, open it in Unitex and preprocess it with following resources:

apply Graphs/Preprocessing/Sentence/SentenceUniver.grf in MERGE mode
apply Graphs/Preprocessing/Replace/replace.grf in REPLACE mode

Apply lexical resources:

the full version of the Russian computational morphological dictionary developed at CIS, Munich: CISLEXru.bin, CISLEXru_disamb-.bin and CISLEXru_EN.bin
Dela/univer.bin
Dela/univer_disamb-.bin

Create a cascade (Text\Apply CasSys Cascade... menu, New) to sequentially apply the following FSTs to your text in REPLACE mode:

Graphs/numbers.fst2
Graphs/abbr/abbr_w.fst2
Graphs/abbr/acronyms_w.fst2
Graphs/Postprocessing/replace.fst2

Launch the cascade of FSTs.
The normalized text is in Corpus/example_csc/example_4_0.snt.

Slides

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Corpus		Corpus
Dela		Dela
Graphs		Graphs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Normatex - Russian text normalization

About

Releases

Packages

License

avlukanin/normatex

Folders and files

Latest commit

History

Repository files navigation

Normatex - Russian text normalization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages