This is a set of Finite-State Transducers (FSTs) for normalization of Russian texts for speech synthesis, machine translation and other natural language processing tasks.
The FSTs are developed using Unitex, a corpus processor.
To normalize a Russian text:
- Copy your text (e.g.
example.txt
) toCorpus
folder, open it in Unitex and preprocess it with following resources:
- apply
Graphs/Preprocessing/Sentence/SentenceUniver.grf
in MERGE mode - apply
Graphs/Preprocessing/Replace/replace.grf
in REPLACE mode
- Apply lexical resources:
- the full version of the Russian computational morphological dictionary developed at CIS, Munich:
CISLEXru.bin
,CISLEXru_disamb-.bin
andCISLEXru_EN.bin
Dela/univer.bin
Dela/univer_disamb-.bin
- Create a cascade (
Text\Apply CasSys Cascade...
menu,New
) to sequentially apply the following FSTs to your text in REPLACE mode:
Graphs/numbers.fst2
Graphs/abbr/abbr_w.fst2
Graphs/abbr/acronyms_w.fst2
Graphs/Postprocessing/replace.fst2
- Launch the cascade of FSTs.
- The normalized text is in
Corpus/example_csc/example_4_0.snt
.