KOI character encodings - AbsoluteAstronomy.com

KOI is a family of several code page

Code page

Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...

s for the Cyrillic alphabet

Cyrillic alphabet

The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

.
The name stands for Kod Obmena Informatsiey which means "Code for Information Exchange".

A particular feature of the KOI code pages is that the text remains human-readable when the leftmost bit

Bit

A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

is stripped, should it inadvertently pass through equipment or software that can only deal with 7 bit wide characters. This is due to characters being placed in a special order (128 codepoints apart from the Latin letter they look most similar to), which, however, does not correspond to the alphabetic order in either language that is written in Cyrillic and necessitates the use of lookup tables

Lookup table

In computer science, a lookup table is a data structure, usually an array or associative array, often used to replace a runtime computation with a simpler array indexing operation. The savings in terms of processing time can be significant, since retrieving a value from memory is often faster than...

to perform sorting

Sorting algorithm

In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order...

.

These encodings are derived from ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

on the base of some correspondence between Latin and Cyrillic (nearly phonetical), which was already used in Russian dialect of Morse code

Morse code

Morse code is a method of transmitting textual information as a series of on-off tones, lights, or clicks that can be directly understood by a skilled listener or observer without special equipment...

and in MTK-2 telegraph code.

KOI8

Modern KOI code pages are 8-bit extensions of ASCII

Extended ASCII

The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

.
This family of encodings is also known as KOI8, KOI 8 and KOI-8.

The family members are:

KOI8-R
KOI8-R
KOI8-R is an 8-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet. It also happens to cover Bulgarian, but is not used since CP1251 is accepted. A derivative encoding is KOI8-U, which adds Ukrainian characters...

for Russian
Russian language
Russian is a Slavic language used primarily in Russia, Belarus, Uzbekistan, Kazakhstan, Tajikistan and Kyrgyzstan. It is an unofficial but widely spoken language in Ukraine, Moldova, Latvia, Turkmenistan and Estonia and, to a lesser extent, the other countries that were once constituent republics...

and Bulgarian
Bulgarian language
Bulgarian is an Indo-European language, a member of the Slavic linguistic group.Bulgarian, along with the closely related Macedonian language, demonstrates several linguistic characteristics that set it apart from all other Slavic languages such as the elimination of case declension, the...
KOI8-U
KOI8-U
KOI8-U is an 8-bit character encoding, designed to cover Ukrainian, which uses the Cyrillic alphabet. It is based on KOI8-R, which covers Russian and Bulgarian, but replaces eight graphic characters with four Ukrainian letters Ґ, Є, І, and Ї in both upper case and lower case.In Microsoft Windows,...

and KOI8-RU for Ukrainian
Ukrainian language
Ukrainian is a language of the East Slavic subgroup of the Slavic languages. It is the official state language of Ukraine. Written Ukrainian uses a variant of the Cyrillic alphabet....

and Belorussian
KOI8-T for Tajik
Tajik language
Tajik, Tajik Persian, or Tajiki, is a variety of modern Persian spoken in Central Asia. Historically Tajiks called their language zabani farsī , meaning Persian language in English; the term zabani tajikī, or Tajik language, was introduced in the 20th century by the Soviets...
KOI8-CS for Czech
Czech language
Czech is a West Slavic language with about 12 million native speakers; it is the majority language in the Czech Republic and spoken by Czechs worldwide. The language was known as Bohemian in English until the late 19th century...

and Slovak
Slovak language
Slovak , is an Indo-European language that belongs to the West Slavic languages .Slovak is the official language of Slovakia, where it is spoken by 5 million people...

(ČSN (Czech technical standard) 369103. Devised by the Comecon
Comecon
The Council for Mutual Economic Assistance , 1949–1991, was an economic organisation under hegemony of Soviet Union comprising the countries of the Eastern Bloc along with a number of communist states elsewhere in the world...

. This encoded Latin with diacritics
Diacritics
diacritics is a quarterly academic journal established in 1971 at Cornell University and published by the Johns Hopkins University Press. Articles serve to review recent literature in the field of literary criticism, and have covered topics in gender studies, political theory, psychoanalysis, queer...

, as used in Czech and Slovak, rather than Cyrillic, but the basic idea was the same - text was ought to remain legible with the 8-th bit cleared, thus e.g. Č became C etc.)
KOI8-O for Old Russian

KOI7

There is also an obsolete 7-bit KOI7

KOI7

KOI7 is a 7-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet.In Russian, KOI7 stands for Kod Obmena Informatsiey, 7 bit which means "Code for Information Exchange, 7 bit"....

code page, which does not contain lowercase letters.
Codes of 31 Russian uppercase letters are just their KOI8 codes with most significant bit cleared. Other code points are the same as in ASCII.

External links

http://koi8.pp.ru/main.html
http://www.orwell.ru/info/cyrsoup
http://czyborra.com/charsets/cyrillic.html
http://www.iis.ru/cyrillic/resource/tables.en.html

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.