Skip to content

Latest commit

 

History

History
48 lines (37 loc) · 2.6 KB

Standards used in Language Technology.md

File metadata and controls

48 lines (37 loc) · 2.6 KB

#Standards used in Language Technology and Lingusitics

##Language related ISO standards

##Language and Language Family Identification

  • ISO 639-1

  • ISO 639-2

  • ISO 639-3

  • ISO 639-4

  • ISO 639-5

  • ISO 639-6

  • Language tags as defined by the Internet Engineering Task Force (IETF)

  • BCP 47: Best Current Practice 47, which includes RFC 5646

  • RFC 5646, which superseded RFC 4646, which superseded RFC 3066. (Therefore all standards which depend on any of these 3 IETF standards now use ISO 639-3.)

##Character Encoding

  • Unicode
    • UTF-8
    • UTF-16

##Script Identification Standards

##Metadata Standards

i18n / Locale data

  • Unicode's CLDR (Common locale data repository): Uses several hundred codes from ISO 639-3 not included in ISO 639-2.

##Text Markup Formats

###Documents

  • HTML5: via IETF's BCP 47.
  • Text Encoding Initiative TEI via IETF's BCP 47.

###Corpora

###Lexicons

  • Lexical Markup Framework: ISO specification for representation of machine-readable dictionaries.