Code page
Encyclopedia
Code page is another term for character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

's EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....

-based mainframe systems, but many vendors use this term including Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

, SAP
SAP AG
SAP AG is a German software corporation that makes enterprise software to manage business operations and customer relations. Headquartered in Walldorf, Baden-Württemberg, with regional offices around the world, SAP is the market leader in enterprise application software...

, and Oracle Corporation
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...

. Vendors often allocate their own code page number to a character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

, even if it is better known by another name (for example UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 character encoding has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP).

The code page numbering system

IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

 introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.

With the release of PC-DOS
PC-DOS
IBM PC DOS is a DOS system for the IBM Personal Computer and compatibles, manufactured and sold by IBM from the 1980s to the 2000s....

 version 3.3 (and the near identical MS-DOS
MS-DOS
MS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...

 3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.

After IBM and Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

 ceased to cooperate in the 1990-s the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one 3rd party vendor (Oracle
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...

) also has its own different list of numeric assignments. IBM's current assignments are listed in their CCSID
CCSID
CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...

 repository. Microsoft's assignments seem not to be documented anywhere, but a list of the names and approximate IANA abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as Internet Explorer
Internet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

).

Most well-known code pages, excluding those for the CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...

 languages and Vietnamese
Vietnamese language
Vietnamese is the national and official language of Vietnam. It is the mother tongue of 86% of Vietnam's population, and of about three million overseas Vietnamese. It is also spoken as a second language by many ethnic minorities of Vietnam...

, fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.

The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to 8 may be stored in the display adaptor for easy switching http://www.osdever.net/FreeVGA/vga/vgatext.htm. There were a selection of 3rd party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.

Relationship to ASCII

The vast majority of code pages in current use are supersets of ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit
Parity bit
A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....

 in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these ‘extended character sets
Extended ASCII
The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

’ and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....

 encodings.

Relationship to Unicode

Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes. In the process, duplicate characters are eliminated and new variants are introduced, like Fullwidth ASCII. While consistent use of any single Unicode encoding would theoretically eliminate the need to keep track of different code pages or character encodings, the existence of multiple encodings of Unicode as well as the need to remain compatible with existing documents and systems that use the older encodings remains. In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all
the other code pages have been technically redefined as encodings for various subsets of Unicode.

IBM PC (OEM) code pages

These code pages were originally embedded directly in the text mode
Text mode
Text mode is a kind of computer display mode in which the content of the screen is internally represented in terms of characters rather than individual pixels. Typically, the screen consists of a uniform rectangular grid of character cells, each of which contains one of the characters of a...

 hardware of the graphic adapters used with the IBM PC
IBM PC
The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...

 and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing a ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets). Since the original IBM PC code page (number 437
Code page 437
IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....

) was not really designed for international use, several partially compatible country or region specific variants emerged. Microsoft refers to these as the OEM code pages because they were defined by the OEM
OEM
OEM means the original manufacturer of a component for a product, which may be resold by another company.OEM may also refer to:-Computing:* OEM font, or OEM-US, the original character set of the IBM PC, circa 1981...

's who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standard body. Examples include:
  • 437
    Code page 437
    IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....

     — The original IBM PC code page
  • 720
    Code page 720
    Code page 720 is a code page used under MS-DOS to write Arabic. The Windows code page for Arabic is Windows-1256.- Codepage layout :...

     — Arabic
    Arabic alphabet
    The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...

  • 737
    Code page 737
    Code page 737 is a code page used under MS-DOS to write Greek language. It was much more popular than code page 869.-Code page layout:...

     — Greek
    Greek alphabet
    The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

  • 775
    Code page 775
    Code page 775 is a code page used under MS-DOS to write the Estonian, Lithuanian and Latvian languages.-Code page layout:...

     — Estonian
    Estonian alphabet
    The Estonian alphabet is used for writing the Estonian language and is based on the Latin alphabet, with German influence. As such, the Estonian alphabet has the letters Ä, Ö, and Ü , which represent the vowel sounds , and , respectively...

    , Lithuanian
    Lithuanian alphabet
    Lithuanian employs a modified Roman script. It is composed of 32 letters. The collation order presents one surprise: "Y" is moved to occur between I Ogonek and J....

     and Latvian
    Latvian alphabet
    The Latvian alphabet is based on the Latin alphabet and consists of 33 letters. 22 of them are from the Latin alphabet; the remaining 11 are obtained from Latin letters by using diacritic marks...

  • 850
    Code page 850
    Code page 850 is a code page used under MS-DOS in Western Europe. It is the code page commonly used by the version of MS-DOS underlying Windows ME...

     — "Multilingual
    Multilingualism
    Multilingualism is the act of using, or promoting the use of, multiple languages, either by an individual speaker or by a community of speakers. Multilingual speakers outnumber monolingual speakers in the world's population. Multilingualism is becoming a social phenomenon governed by the needs of...

     (Latin-1)" (Western Europe
    Western Europe
    Western Europe is a loose term for the collection of countries in the western most region of the European continents, though this definition is context-dependent and carries cultural and political connotations. One definition describes Western Europe as a geographic entity—the region lying in the...

    an languages)
  • 852
    Code page 852
    Code page 852 is a code page used under MS-DOS to write Central European languages that use Latin script ....

     — "Slavic
    Slavic languages
    The Slavic languages , a group of closely related languages of the Slavic peoples and a subgroup of Indo-European languages, have speakers in most of Eastern Europe, in much of the Balkans, in parts of Central Europe, and in the northern part of Asia.-Branches:Scholars traditionally divide Slavic...

     (Latin-2)" (Central
    Central Europe
    Central Europe or alternatively Middle Europe is a region of the European continent lying between the variously defined areas of Eastern and Western Europe...

     and Eastern Europe
    Eastern Europe
    Eastern Europe is the eastern part of Europe. The term has widely disparate geopolitical, geographical, cultural and socioeconomic readings, which makes it highly context-dependent and even volatile, and there are "almost as many definitions of Eastern Europe as there are scholars of the region"...

    an languages)
  • 855
    Code page 855
    Code page 855 is a code page used under MS-DOS to write Cyrillic script. This code page is not used much.-Code page layout:...

     — Cyrillic
    Cyrillic alphabet
    The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

  • 857
    Code page 857
    Code page 857 is a code page used under MS-DOS to write Turkish.Code page 857 is based on code page 850, but with many changes. It includes all characters from ISO 8859-9.-Code page layout:...

     — Turkish
    Turkish alphabet
    The Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...

  • 858
    Code page 858
    Code page 858 is a code page used under MS-DOS to write Western European languages.Code page 858 was created from code page 850 in 1998 by changing code point 213 from dotless I ⟨ı⟩ to the euro sign ⟨€⟩....

     — "Multilingual" with euro
    Euro
    The euro is the official currency of the eurozone: 17 of the 27 member states of the European Union. It is also the currency used by the Institutions of the European Union. The eurozone consists of Austria, Belgium, Cyprus, Estonia, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg,...

     symbol
  • 860
    Code page 860
    Code page 860 is a code page used under MS-DOS to write Portuguese.-Code page layout:...

     — Portuguese
    Portuguese alphabet
    The Portuguese alphabet, , consists of the following 23 or 26 Latin letters:In addition, the following characters with diacritics are used: Áá, Ââ, Ãã, Àà, Çç, Éé, Êê, Íí, Óó, Ôô, Õõ, Úú. These are not, however, treated as independent letters in collation, nor do they have entries of their own in...

  • 861
    Code page 861
    Code page 861 is a code page used under MS-DOS to write the Icelandic language .-Code page layout:...

     — Icelandic
    Icelandic alphabet
    The modern Icelandic alphabet consists of the following 32 letters:It is a Latin alphabet with diacritics, in addition it includes the character eth Ðð and the runic letter thorn Þþ...

  • 862
    Code page 862
    Code page 862 is a code page used under MS-DOS for Hebrew.Like ISO 8859-8, it encodes only letters, not vowel-points or cantillation marks...

     — Hebrew
    Hebrew alphabet
    The Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...

  • 863
    Code page 863
    Code page 863 is a code page used under MS-DOS to write French language .-Code page layout:...

     — French
    French alphabet
    The French alphabet is based on the 26 letters of the Latin alphabet, uppercase and lowercase, with five diacritics and two orthographic ligatures.-Letter names:- Diacritics :...

     (Quebec French
    Quebec French
    Quebec French , or Québécois French, is the predominant variety of the French language in Canada, in its formal and informal registers. Quebec French is used in everyday communication, as well as in education, the media, and government....

    )
  • 865
    Code page 865
    Code page 865 is a code page used under MS-DOS to write Nordic languages ....

     — Danish
    Danish language
    Danish is a North Germanic language spoken by around six million people, principally in the country of Denmark. It is also spoken by 50,000 Germans of Danish ethnicity in the northern parts of Schleswig-Holstein, Germany, where it holds the status of minority language...

    /Norwegian
    Norwegian language
    Norwegian is a North Germanic language spoken primarily in Norway, where it is the official language. Together with Swedish and Danish, Norwegian forms a continuum of more or less mutually intelligible local and regional variants .These Scandinavian languages together with the Faroese language...

     Differs from 437 only in the letter Ø (ø) in place of ¥ and ¢
  • 866
    Code page 866
    Code page 866 is a code page used under MS-DOS to write Cyrillic script. It is based on the "alternative character set" of GOST 19768-87...

     — Cyrillic
    Cyrillic alphabet
    The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

  • 869
    Code page 869
    Code page 869 is a code page used under MS-DOS to write Greek language. It is also called MS-DOS Greek 2. It was designed to include all characters from ISO 8859-7.Code page 869 was not as popular as code page 737....

     — Greek
    Greek alphabet
    The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

  • 874 — Thai
    Thai alphabet
    Thai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....



When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but use of newer code pages, in particular Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

, is encouraged for new designs.

Code pages for DBCS character sets

These code pages represent DBCS
DBCS
A double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols...

 character encodings for various CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...

 languages. In Microsoft operating systems, these are used as both the "OEM" and "ANSI" code page for the applicable locale.
  • 932
    Code page 932
    Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions...

     — Supports Japanese
    Japanese writing system
    The modern Japanese writing system uses three main scripts:*Kanji, adopted Chinese characters*Kana, a pair of syllabaries , consisting of:...

  • 936
    Code page 936
    Code page 936 is Microsoft's character encoding for simplified Chinese, one of the four DBCSs for East Asian languages. Originally it was identical to GB 2312, and expanded to cover most part of GBK with the release of Windows 95; now superseded by Code page 54936 .-External links:**...

     — GBK
    GBK
    GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...

     Supports Simplified Chinese
  • 949
    Code page 949
    Code page 949 is Microsoft's implementation that appears similar to EUC-KR. This code page supports the Korean language. The code page is not registered with IANA, and hence, is not a standard to communicate information over the Internet, although it's often used for that. UTF-8 is much preferred...

     — Supports Korean
    Hangul
    Hangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean...

  • 950
    Code page 950
    Code page 950 is Microsoft's implementation of the de facto standard Big5. The code page is not registered with IANA, and hence, is not a standard to communicate information over the internet. The major difference between code page 950 and Big5 is the incorporation of some ETEN characters at...

     — Supports Traditional Chinese

Microsoft code page numbers for various other character encodings

The following code page numbers are specific to Microsoft Windows. IBM may use different numbers for these code pages.
  • 1200 — UTF-16LE Unicode
    Unicode
    Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

     little-endian
  • 1201 — UTF-16BE Unicode
    Unicode
    Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

     big-endian
  • 65000 — UTF-7
    UTF-7
    UTF-7 is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters...

     Unicode
    Unicode
    Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

  • 65001 — UTF-8
    UTF-8
    UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

     Unicode
    Unicode
    Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

  • 10000 — Macintosh Roman encoding (followed by several other Mac character sets)
  • 10007 — Macintosh Cyrillic encoding
    MacCyrillic encoding
    The Macintosh Cyrillic encoding is used in Apple Macintosh computers to represent texts in the Cyrillic script.Each character is shown with its equivalent Unicode code point and its decimal code point. Only the second half of the table is shown, the first half being the same as ASCII....

  • 10029 — Macintosh Central European encoding
    Macintosh Central European encoding
    Macintosh Central European encoding is used in Apple Macintosh computers to represent texts in Central European and Southeastern European languages that use the Latin script....

  • 20127 — US-ASCII The classic US 7 bit character set with no char larger than 127
  • 28591 — ISO-8859-1 (followed by ISO-8859-2 to ISO-8859-15)

Miscellaneous

  • (number missing) — ASMO449+ Supports Arabic
    Arabic alphabet
    The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...

  • (number missing) — MIK
    MIK Code page
    MIK is a Cyrillic code page used with MS-DOS. It is based on the character set used in the Bulgarian IBM PC compatible system.This is the most widespread DOS/OEM code page used in Bulgaria, rather than CP 855, CP 866 or CP 872....

     Supports Bulgarian and Russian
    Russian alphabet
    The Russian alphabet is a form of the Cyrillic script, developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

     as well

Windows (ANSI) code pages

Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

 defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocrypha
Apocrypha
The term apocrypha is used with various meanings, including "hidden", "esoteric", "spurious", "of questionable authenticity", ancient Chinese "revealed texts and objects" and "Christian texts that are not canonical"....

l ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.
  • 1250
    Windows-1250
    Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian , Romanian and Albanian...

     — Central
    Central Europe
    Central Europe or alternatively Middle Europe is a region of the European continent lying between the variously defined areas of Eastern and Western Europe...

     and East European
    Eastern Europe
    Eastern Europe is the eastern part of Europe. The term has widely disparate geopolitical, geographical, cultural and socioeconomic readings, which makes it highly context-dependent and even volatile, and there are "almost as many definitions of Eastern Europe as there are scholars of the region"...

     Latin
  • 1251
    Windows-1251
    Windows-1251 is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian, Serbian Cyrillic and other languages...

     — Cyrillic
    Cyrillic alphabet
    The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

  • 1252
    Windows-1252
    Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

     — West European
    Western Europe
    Western Europe is a loose term for the collection of countries in the western most region of the European continents, though this definition is context-dependent and carries cultural and political connotations. One definition describes Western Europe as a geographic entity—the region lying in the...

     Latin
  • 1253
    Windows-1253
    Windows-1253 is a Windows code page used to write modern Greek. It is not capable of supporting the older polytonic Greek. It is not fully compatible with ISO 8859-7 because the letters like Ά are located at different byte values....

     — Greek
    Greek alphabet
    The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

  • 1254
    Windows-1254
    Windows-1254 is a code page used under Microsoft Windows to write Turkish. Characters with codepoints A0 through FF are compatible with ISO 8859-9.Unicode is preferred to windows 1254 for modern applications- Code page layout :...

     — Turkish
    Turkish alphabet
    The Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...

  • 1255
    Windows-1255
    Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols are in the same positions Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols...

     — Hebrew
    Hebrew alphabet
    The Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...

  • 1256
    Windows-1256
    Windows-1256 is a code page used to write Arabic under Microsoft Windows.  This code page is not compatible with ISO 8859-6 and MacArabic encodings....

     — Arabic
    Arabic alphabet
    The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...

  • 1257
    Windows-1257
    Windows-1257 is a single byte code page used to support the Estonian, Latvian and Lithuanian languages under Microsoft Windows. This code page is similar in layout to ISO 8859-13, but they differ in codepoints A1, A5, B4, FF, and of course in the range 80–9F, which is typically allocated with...

     — Baltic
    Baltic languages
    The Baltic languages are a group of related languages belonging to the Balto-Slavic branch of the Indo-European language family and spoken mainly in areas extending east and southeast of the Baltic Sea in Northern Europe...

  • 1258
    Windows-1258
    Windows-1258 is a codepage used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is not compatible with VISCII...

     — Vietnamese
    Vietnamese alphabet
    The Vietnamese alphabet, called Chữ Quốc Ngữ , usually shortened to Quốc Ngữ , is the modern writing system for the Vietnamese language...

  • 874 — Thai
    Thai alphabet
    Thai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....



Microsoft recommends applications use UTF-8 or UCS-2/UTF-16 instead of these code pages.

Criticism

Many older character encodings, except Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

, suffer from several problems.
  1. Some code page vendors insufficiently document the meaning of all code point values. This decreases the reliability of handling textual data through various computer systems consistently.
  2. Some vendors add proprietary extensions to some code pages to add or change certain code point values. For example, byte \x5C in Shift JIS can represent either a back slash or a yen currency symbol depending on the platform.
  3. In order to support several languages in a program that does not use Unicode, the code page used for each string/document needs to be stored.


Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, these problems are rarely a concern for Unicode.

Applications may also mislabel text in Windows-1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

 as ISO-8859-1. Fortunately, the only difference between these code pages is that the code point values used by ISO-8859-1 for control characters are instead used as additional printable characters in Windows-1252. Since control characters have no function in HTML, web browsers tend to use Windows-1252 rather than ISO-8859-1.

Private code pages

When, early in the history of personal computers, users didn't find their character encoding requirements met, private or local code pages were created using Terminate and Stay Resident
Terminate and Stay Resident
Terminate and Stay Resident is a computer system call in DOS computer operating systems that returns control to the system as if the program has quit, but keeps the program in memory...

 utilities or by re-programming BIOS
BIOS
In IBM PC compatible computers, the basic input/output system , also known as the System BIOS or ROM BIOS , is a de facto standard defining a firmware interface....

 EPROM
EPROM
An EPROM , or erasable programmable read only memory, is a type of memory chip that retains its data when its power supply is switched off. In other words, it is non-volatile. It is an array of floating-gate transistors individually programmed by an electronic device that supplies higher voltages...

s. In some cases, unofficial code page numbers were invented (e.g., cp895).

When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický
Kamenický encoding
The Kamenický encoding , named for the brothers Jiří and Marian Kamenický, was a code page for personal computers running MS-DOS, very popular in Czechoslovakia around 1985–1995...

 or KEYBCS2 encoding for the Czech
Czech alphabet
The Czech alphabet is a version of the Latin script, used when writing Czech. Its basic principles are "one sound, one letter" and the addition of diacritical marks above letters to represent sounds alien to Latin...

 and Slovak
Slovak alphabet
The Slovak alphabet uses a modification of the Latin alphabet. The modifications include the four diacriticals placed above certain letters. Therefore the Slovak alphabet has 46 graphemes.- Vowels :- Consonants :Notes...

 alphabets. Another character set is Iran System encoding standard
Iran System encoding standard
Iran System encoding standard was an 8-bit character encoding scheme and was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft codepage 1256 this standard became obsolete...

 that was created by Iran System corporation for Persian language
Persian language
Persian is an Iranian language within the Indo-Iranian branch of the Indo-European languages. It is primarily spoken in Iran, Afghanistan, Tajikistan and countries which historically came under Persian influence...

 support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.

See also

  • Windows code page
    Windows code page
    Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s...

  • Character encoding
    Character encoding
    A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

  • CCSID
    CCSID
    CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...

    IBM's official "code page" definitions and assignments.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK