A library for transcribing strings of Chinese characters to their readings in Mandarin.
An example JavaScript application: http://bdon.org/hanzireader/
- Disambiguates multiple-reading characters based on a dictionary.
- Defines a binary format for dictionaries that can be loaded at runtime.
- The dictionary format is designed to be as compact as possible.
- Dictonaries are agnostic to Traditional/Simplified script and transliteration format, and store pronunciations as 2-byte syllable sequences based on Zhuyin.
- A typical dictionary CC-CEDICT in this format is around 700 kB, or less than 300 kB Brotli-compressed, meaning it is practical to load the entire dictionary once over the web and then perform transcription without any network communication.
- The library and dictionary can be shared across multiple programming languages. Python and JavaScript are supported right now.
Javascript: npm install hanzi2reading
Python: pip install hanzi2reading
- CC-CEDICT. Licensed CC-BY-SA.
- Moedict. Licensed CC-BY-ND. https://github.com/g0v/moedict-data/blob/master/README.md
- Unihan database, which contains 1-grams only. Licensed under Unicode License.
- This library only does dictionary-based lookups of character sequences. It does not attempt to disambiguate readings based on parts of speech, which is necessary for transcribing complete sentences.
- Word segmentation and proper nouns for formatted Pinyin is not supported, but may be in the future.
Part | Bits |
---|---|
Initial | 5 |
Medial | 2 |
Final | 4 |
Tone | 3 |
Er | 1 |
A syllable is serialized in a dictionary as a 2-byte sequence (little-endian). When loaded into a programming runtime, a syllable is a tuple or array of five integers. Example: the syllable kiāng ㄎㄧㄤ corresponds to the array [10,1,11,1,0] or the byte sequence 0b 1011 0010 0010 1001
- https://github.com/mozillazg/python-pinyin (SC only, data embedded in code)
- https://github.com/tsroten/dragonmapper (data is in large CSV files, Python only)
- https://github.com/g0v/moedict-data
- https://cc-cedict.org/editor/editor.php
- https://chrome.google.com/webstore/detail/zhongwen-chinese-english/kkmlkkjojmombglmlpbpapmhcaljjkde
- https://github.com/skishore/makemeahanzi