Code page is another term for
character encodingA character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
. It consists of a table of values that describes the character set for a particular language. The term code page originated from
IBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
's
EBCDICExtended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
-based mainframe systems, but many vendors use this term including
MicrosoftMicrosoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
,
SAPSAP AG is a German software corporation that makes enterprise software to manage business operations and customer relations. Headquartered in Walldorf, Baden-Württemberg, with regional offices around the world, SAP is the market leader in enterprise application software...
, and
Oracle CorporationOracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
. Vendors often allocate their own code page number to a
character encodingA character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
, even if it is better known by another name (for example
UTF-8UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
character encoding has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP).
The code page numbering system
IBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each
character encodingA character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.
With the release of
PC-DOSIBM PC DOS is a DOS system for the IBM Personal Computer and compatibles, manufactured and sold by IBM from the 1980s to the 2000s....
version 3.3 (and the near identical
MS-DOSMS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.
After IBM and
MicrosoftMicrosoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
ceased to cooperate in the 1990-s the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one 3rd party vendor (
OracleOracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
) also has its own different list of numeric assignments. IBM's current assignments are listed in their
CCSIDCCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...
repository. Microsoft's assignments seem not to be documented anywhere, but a list of the names and approximate IANA abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as
Internet ExplorerWindows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...
).
Most well-known code pages, excluding those for the
CJKCJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
languages and
VietnameseVietnamese is the national and official language of Vietnam. It is the mother tongue of 86% of Vietnam's population, and of about three million overseas Vietnamese. It is also spoken as a second language by many ethnic minorities of Vietnam...
, fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.
The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to 8 may be stored in the display adaptor for easy switching
http://www.osdever.net/FreeVGA/vga/vgatext.htm. There were a selection of 3rd party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.
Relationship to ASCII
The vast majority of code pages in current use are supersets of
ASCIIThe American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a
parity bitA parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....
in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these ‘
extended character setsThe term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...
’ and vendors referred to the variants as code pages, as IBM had always done for variants of
EBCDICExtended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
encodings.
Relationship to Unicode
UnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes. In the process, duplicate characters are eliminated and new variants are introduced, like
Fullwidth ASCII. While consistent use of any single Unicode encoding would theoretically eliminate the need to keep track of different code pages or character encodings, the existence of multiple encodings of Unicode as well as the need to remain compatible with existing documents and systems that use the older encodings remains. In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all
the other code pages have been technically redefined as encodings for various subsets of Unicode.
IBM PC (OEM) code pages
These code pages were originally embedded directly in the
text modeText mode is a kind of computer display mode in which the content of the screen is internally represented in terms of characters rather than individual pixels. Typically, the screen consists of a uniform rectangular grid of character cells, each of which contains one of the characters of a...
hardware of the graphic adapters used with the
IBM PCThe IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...
and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing a ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets). Since the original IBM PC code page (
number 437IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....
) was not really designed for international use, several partially compatible country or region specific variants emerged. Microsoft refers to these as the OEM code pages because they were defined by the
OEMOEM means the original manufacturer of a component for a product, which may be resold by another company.OEM may also refer to:-Computing:* OEM font, or OEM-US, the original character set of the IBM PC, circa 1981...
's who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standard body. Examples include:
- 437
IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....
— The original IBM PC code page
- 720
Code page 720 is a code page used under MS-DOS to write Arabic. The Windows code page for Arabic is Windows-1256.- Codepage layout :...
— ArabicThe Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...
- 737
Code page 737 is a code page used under MS-DOS to write Greek language. It was much more popular than code page 869.-Code page layout:...
— GreekThe Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...
- 775
Code page 775 is a code page used under MS-DOS to write the Estonian, Lithuanian and Latvian languages.-Code page layout:...
— EstonianThe Estonian alphabet is used for writing the Estonian language and is based on the Latin alphabet, with German influence. As such, the Estonian alphabet has the letters Ä, Ö, and Ü , which represent the vowel sounds , and , respectively...
, LithuanianLithuanian employs a modified Roman script. It is composed of 32 letters. The collation order presents one surprise: "Y" is moved to occur between I Ogonek and J....
and LatvianThe Latvian alphabet is based on the Latin alphabet and consists of 33 letters. 22 of them are from the Latin alphabet; the remaining 11 are obtained from Latin letters by using diacritic marks...
- 850
Code page 850 is a code page used under MS-DOS in Western Europe. It is the code page commonly used by the version of MS-DOS underlying Windows ME...
— "MultilingualMultilingualism is the act of using, or promoting the use of, multiple languages, either by an individual speaker or by a community of speakers. Multilingual speakers outnumber monolingual speakers in the world's population. Multilingualism is becoming a social phenomenon governed by the needs of...
(Latin-1)" (Western EuropeWestern Europe is a loose term for the collection of countries in the western most region of the European continents, though this definition is context-dependent and carries cultural and political connotations. One definition describes Western Europe as a geographic entity—the region lying in the...
an languages)
- 852
Code page 852 is a code page used under MS-DOS to write Central European languages that use Latin script ....
— "SlavicThe Slavic languages , a group of closely related languages of the Slavic peoples and a subgroup of Indo-European languages, have speakers in most of Eastern Europe, in much of the Balkans, in parts of Central Europe, and in the northern part of Asia.-Branches:Scholars traditionally divide Slavic...
(Latin-2)" (CentralCentral Europe or alternatively Middle Europe is a region of the European continent lying between the variously defined areas of Eastern and Western Europe...
and Eastern EuropeEastern Europe is the eastern part of Europe. The term has widely disparate geopolitical, geographical, cultural and socioeconomic readings, which makes it highly context-dependent and even volatile, and there are "almost as many definitions of Eastern Europe as there are scholars of the region"...
an languages)
- 855
Code page 855 is a code page used under MS-DOS to write Cyrillic script. This code page is not used much.-Code page layout:...
— CyrillicThe Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
- 857
Code page 857 is a code page used under MS-DOS to write Turkish.Code page 857 is based on code page 850, but with many changes. It includes all characters from ISO 8859-9.-Code page layout:...
— TurkishThe Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...
- 858
Code page 858 is a code page used under MS-DOS to write Western European languages.Code page 858 was created from code page 850 in 1998 by changing code point 213 from dotless I ⟨ı⟩ to the euro sign ⟨€⟩....
— "Multilingual" with euroThe euro is the official currency of the eurozone: 17 of the 27 member states of the European Union. It is also the currency used by the Institutions of the European Union. The eurozone consists of Austria, Belgium, Cyprus, Estonia, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg,...
symbol
- 860
Code page 860 is a code page used under MS-DOS to write Portuguese.-Code page layout:...
— PortugueseThe Portuguese alphabet, , consists of the following 23 or 26 Latin letters:In addition, the following characters with diacritics are used: Áá, Ââ, Ãã, Àà, Çç, Éé, Êê, Íí, Óó, Ôô, Õõ, Úú. These are not, however, treated as independent letters in collation, nor do they have entries of their own in...
- 861
Code page 861 is a code page used under MS-DOS to write the Icelandic language .-Code page layout:...
— IcelandicThe modern Icelandic alphabet consists of the following 32 letters:It is a Latin alphabet with diacritics, in addition it includes the character eth Ðð and the runic letter thorn Þþ...
- 862
Code page 862 is a code page used under MS-DOS for Hebrew.Like ISO 8859-8, it encodes only letters, not vowel-points or cantillation marks...
— HebrewThe Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...
- 863
Code page 863 is a code page used under MS-DOS to write French language .-Code page layout:...
— FrenchThe French alphabet is based on the 26 letters of the Latin alphabet, uppercase and lowercase, with five diacritics and two orthographic ligatures.-Letter names:- Diacritics :...
(Quebec FrenchQuebec French , or Québécois French, is the predominant variety of the French language in Canada, in its formal and informal registers. Quebec French is used in everyday communication, as well as in education, the media, and government....
)
- 865
Code page 865 is a code page used under MS-DOS to write Nordic languages ....
— DanishDanish is a North Germanic language spoken by around six million people, principally in the country of Denmark. It is also spoken by 50,000 Germans of Danish ethnicity in the northern parts of Schleswig-Holstein, Germany, where it holds the status of minority language...
/NorwegianNorwegian is a North Germanic language spoken primarily in Norway, where it is the official language. Together with Swedish and Danish, Norwegian forms a continuum of more or less mutually intelligible local and regional variants .These Scandinavian languages together with the Faroese language...
Differs from 437 only in the letter Ø (ø) in place of ¥ and ¢
- 866
Code page 866 is a code page used under MS-DOS to write Cyrillic script. It is based on the "alternative character set" of GOST 19768-87...
— CyrillicThe Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
- 869
Code page 869 is a code page used under MS-DOS to write Greek language. It is also called MS-DOS Greek 2. It was designed to include all characters from ISO 8859-7.Code page 869 was not as popular as code page 737....
— GreekThe Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...
- 874 — Thai
Thai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....
When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but use of newer code pages, in particular
UnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
, is encouraged for new designs.
Code pages for DBCS character sets
These code pages represent
DBCSA double-byte character set is a character set that represents each character with 2 bytes. The DBCS supports national languages that contain a large number of unique characters or symbols...
character encodings for various
CJKCJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
languages. In Microsoft operating systems, these are used as both the "OEM" and "ANSI" code page for the applicable locale.
- 932
Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions...
— Supports JapaneseThe modern Japanese writing system uses three main scripts:*Kanji, adopted Chinese characters*Kana, a pair of syllabaries , consisting of:...
- 936
Code page 936 is Microsoft's character encoding for simplified Chinese, one of the four DBCSs for East Asian languages. Originally it was identical to GB 2312, and expanded to cover most part of GBK with the release of Windows 95; now superseded by Code page 54936 .-External links:**...
— GBKGBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...
Supports Simplified Chinese
- 949
Code page 949 is Microsoft's implementation that appears similar to EUC-KR. This code page supports the Korean language. The code page is not registered with IANA, and hence, is not a standard to communicate information over the Internet, although it's often used for that. UTF-8 is much preferred...
— Supports KoreanHangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean...
- 950
Code page 950 is Microsoft's implementation of the de facto standard Big5. The code page is not registered with IANA, and hence, is not a standard to communicate information over the internet. The major difference between code page 950 and Big5 is the incorporation of some ETEN characters at...
— Supports Traditional Chinese
Microsoft code page numbers for various other character encodings
The following code page numbers are specific to Microsoft Windows. IBM may use different numbers for these code pages.
- 1200 — UTF-16LE Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
little-endian
- 1201 — UTF-16BE Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
big-endian
- 65000 — UTF-7
UTF-7 is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters...
UnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
- 65001 — UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
UnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
- 10000 — Macintosh Roman encoding (followed by several other Mac character sets)
- 10007 — Macintosh Cyrillic encoding
The Macintosh Cyrillic encoding is used in Apple Macintosh computers to represent texts in the Cyrillic script.Each character is shown with its equivalent Unicode code point and its decimal code point. Only the second half of the table is shown, the first half being the same as ASCII....
- 10029 — Macintosh Central European encoding
Macintosh Central European encoding is used in Apple Macintosh computers to represent texts in Central European and Southeastern European languages that use the Latin script....
- 20127 — US-ASCII The classic US 7 bit character set with no char larger than 127
- 28591 — ISO-8859-1 (followed by ISO-8859-2 to ISO-8859-15)
Miscellaneous
- (number missing) — ASMO449+ Supports Arabic
The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...
- (number missing) — MIK
MIK is a Cyrillic code page used with MS-DOS. It is based on the character set used in the Bulgarian IBM PC compatible system.This is the most widespread DOS/OEM code page used in Bulgaria, rather than CP 855, CP 866 or CP 872....
Supports Bulgarian and RussianThe Russian alphabet is a form of the Cyrillic script, developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
as well
Windows (ANSI) code pages
MicrosoftMicrosoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an
apocryphaThe term apocrypha is used with various meanings, including "hidden", "esoteric", "spurious", "of questionable authenticity", ancient Chinese "revealed texts and objects" and "Christian texts that are not canonical"....
l ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.
- 1250
Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian , Romanian and Albanian...
— CentralCentral Europe or alternatively Middle Europe is a region of the European continent lying between the variously defined areas of Eastern and Western Europe...
and East EuropeanEastern Europe is the eastern part of Europe. The term has widely disparate geopolitical, geographical, cultural and socioeconomic readings, which makes it highly context-dependent and even volatile, and there are "almost as many definitions of Eastern Europe as there are scholars of the region"...
Latin
- 1251
Windows-1251 is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian, Serbian Cyrillic and other languages...
— CyrillicThe Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
- 1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
— West EuropeanWestern Europe is a loose term for the collection of countries in the western most region of the European continents, though this definition is context-dependent and carries cultural and political connotations. One definition describes Western Europe as a geographic entity—the region lying in the...
Latin
- 1253
Windows-1253 is a Windows code page used to write modern Greek. It is not capable of supporting the older polytonic Greek. It is not fully compatible with ISO 8859-7 because the letters like Ά are located at different byte values....
— GreekThe Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...
- 1254
Windows-1254 is a code page used under Microsoft Windows to write Turkish. Characters with codepoints A0 through FF are compatible with ISO 8859-9.Unicode is preferred to windows 1254 for modern applications- Code page layout :...
— TurkishThe Turkish alphabet is a Latin alphabet used for writing the Turkish language, consisting of 29 letters, seven of which have been modified from their Latin originals for the phonetic requirements of the language. This alphabet represents modern Turkish pronunciation with a high degree of accuracy...
- 1255
Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols are in the same positions Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols...
— HebrewThe Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...
- 1256
Windows-1256 is a code page used to write Arabic under Microsoft Windows. This code page is not compatible with ISO 8859-6 and MacArabic encodings....
— ArabicThe Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...
- 1257
Windows-1257 is a single byte code page used to support the Estonian, Latvian and Lithuanian languages under Microsoft Windows. This code page is similar in layout to ISO 8859-13, but they differ in codepoints A1, A5, B4, FF, and of course in the range 80–9F, which is typically allocated with...
— BalticThe Baltic languages are a group of related languages belonging to the Balto-Slavic branch of the Indo-European language family and spoken mainly in areas extending east and southeast of the Baltic Sea in Northern Europe...
- 1258
Windows-1258 is a codepage used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is not compatible with VISCII...
— VietnameseThe Vietnamese alphabet, called Chữ Quốc Ngữ , usually shortened to Quốc Ngữ , is the modern writing system for the Vietnamese language...
- 874 — Thai
Thai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....
Microsoft recommends applications use UTF-8 or UCS-2/UTF-16 instead of these code pages.
Criticism
Many older character encodings, except
UnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
, suffer from several problems.
- Some code page vendors insufficiently document the meaning of all code point values. This decreases the reliability of handling textual data through various computer systems consistently.
- Some vendors add proprietary extensions to some code pages to add or change certain code point values. For example, byte \x5C in Shift JIS can represent either a back slash or a yen currency symbol depending on the platform.
- In order to support several languages in a program that does not use Unicode, the code page used for each string/document needs to be stored.
Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, these problems are rarely a concern for Unicode.
Applications may also mislabel text in
Windows-1252Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
as ISO-8859-1. Fortunately, the only difference between these code pages is that the code point values used by ISO-8859-1 for control characters are instead used as additional printable characters in Windows-1252. Since control characters have no function in HTML, web browsers tend to use Windows-1252 rather than ISO-8859-1.
Private code pages
When, early in the history of personal computers, users didn't find their character encoding requirements met, private or local code pages were created using
Terminate and Stay ResidentTerminate and Stay Resident is a computer system call in DOS computer operating systems that returns control to the system as if the program has quit, but keeps the program in memory...
utilities or by re-programming
BIOSIn IBM PC compatible computers, the basic input/output system , also known as the System BIOS or ROM BIOS , is a de facto standard defining a firmware interface....
EPROMAn EPROM , or erasable programmable read only memory, is a type of memory chip that retains its data when its power supply is switched off. In other words, it is non-volatile. It is an array of floating-gate transistors individually programmed by an electronic device that supplies higher voltages...
s. In some cases, unofficial code page numbers were invented (
e.g., cp895).
When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the
KamenickýThe Kamenický encoding , named for the brothers Jiří and Marian Kamenický, was a code page for personal computers running MS-DOS, very popular in Czechoslovakia around 1985–1995...
or KEYBCS2 encoding for the
CzechThe Czech alphabet is a version of the Latin script, used when writing Czech. Its basic principles are "one sound, one letter" and the addition of diacritical marks above letters to represent sounds alien to Latin...
and
SlovakThe Slovak alphabet uses a modification of the Latin alphabet. The modifications include the four diacriticals placed above certain letters. Therefore the Slovak alphabet has 46 graphemes.- Vowels :- Consonants :Notes...
alphabets. Another character set is
Iran System encoding standardIran System encoding standard was an 8-bit character encoding scheme and was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft codepage 1256 this standard became obsolete...
that was created by Iran System corporation for
Persian languagePersian is an Iranian language within the Indo-Iranian branch of the Indo-European languages. It is primarily spoken in Iran, Afghanistan, Tajikistan and countries which historically came under Persian influence...
support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.
See also
- Windows code page
Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s...
- Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
- CCSID
CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...
IBM's official "code page" definitions and assignments.
External links