Character encoding - AbsoluteAstronomy.com

A character encoding system consists of a code

Code

A code is a rule for converting a piece of information into another form or representation , not necessarily of the same type....

that pairs each character

Character (computing)

In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....

from a given repertoire with something else, such as a sequence of natural number

Number

A number is a mathematical object used to count and measure. In mathematics, the definition of number has been extended over the years to include such numbers as zero, negative numbers, rational numbers, irrational numbers, and complex numbers....

s, octets

Octet (computing)

An octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, as there is no standard for the size of the byte.-Overview:...

or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text

Character (computing)

in computer

Computer

A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...

s.

Other terms like character encoding, character set (charset), and sometimes character map or code page are used almost interchangeably, but these terms now have related but distinct meanings. See general terminology.

A character code may be represented as a bit

Bit

A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

pattern, octets

Octet (computing)

, or a sequence of electrical pulses.

History

Common examples of character encoding systems include Morse code

Morse code

Morse code is a method of transmitting textual information as a series of on-off tones, lights, or clicks that can be directly understood by a skilled listener or observer without special equipment...

, the Baudot code

Baudot code

The Baudot code, invented by Émile Baudot, is a character set predating EBCDIC and ASCII. It was the predecessor to the International Telegraph Alphabet No 2 , the teleprinter code in use until the advent of ASCII. Each character in the alphabet is represented by a series of bits, sent over a...

, the American Standard Code for Information Interchange (ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

) and Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

.

Morse code

Morse code

was introduced in the 1840s and is used to encode each letter of the Latin alphabet

Latin alphabet

The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

and each Arabic numeral as a series of long and short presses of a telegraph key

Telegraph key

Telegraph key is a general term for any switching device used primarily to send Morse code. Similar keys are used for all forms of manual telegraphy, such as in electrical telegraph and radio telegraphy.- Types of keys :...

. Representations of characters encoded using Morse code varied in length.

The Baudot code

Baudot code

was created by Émile Baudot

Émile Baudot

Jean-Maurice-Émile Baudot , French telegraph engineer and inventor of the first means of digital communication Baudot code, was one of the pioneers of telecommunications...

in 1870, patented in 1874, modified by Donald Murray in 1901, and standardized by CCITT as International Telegraph Alphabet No. 2 (ITA2) in 1930.

ASCII

ASCII

was introduced in 1963 and is a 7-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixed-length codes using integer

Integer

The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

s.

IBM

IBM

International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

's Extended Binary Coded Decimal Interchange Code (usually abbreviated EBCDIC

EBCDIC

Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....

) is an 8-bit encoding scheme developed in 1963.

The limitations of such sets soon became apparent, and a number of ad hoc methods were developed to extend them. The need to support more writing system

Writing system

A writing system is a symbolic system used to represent elements or statements expressible in language.-General properties:Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to...

s for different languages, including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.

Early binary repertoires include:

Braille
Braille
The Braille system is a method that is widely used by blind people to read and write, and was the first digital form of writing.Braille was devised in 1825 by Louis Braille, a blind Frenchman. Each Braille character, or cell, is made up of six dot positions, arranged in a rectangle containing two...
International maritime signal flags
International maritime signal flags
The system of international maritime signal flags is one system of flag signals representing individual letters of the alphabet in signals to or from ships...
Chinese telegraph code
Chinese telegraph code
The Chinese Telegraph Code, Chinese Telegraphic Code, or Chinese Commercial Code is a four-digit decimal code for electrically telegraphing messages written with Chinese characters.- Encoding and decoding :...

(Hans Schjellerup
Hans Schjellerup
Hans Carl Frederik Christian Schjellerup was a Danish astronomer.He was born at Odense, the son of a jeweller. Initially he was apprenticed as a watch maker, but in 1848 he passed the entrance exam for the Polytechnic School of Copenhagen...

, 1869, modified 1872 and following)

Encoding of Chinese characters as 4-digit decimals.

Common character encodings

ISO 646
ISO/IEC 646
ISO/IEC 646:1991, Information technology — ISO 7-bit coded character set for information interchange, is an ISO standard that since its first edition in 1972 has specified a 7-bit character code from which several national standards are derived...
- ASCII
  ASCII
  The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
EBCDIC
- CP37
- CP930
- CP1047
  EBCDIC 1047
  Code page 01047 is an EBCDIC code page with the full Latin-1 character set.It is possible to translate the character codes from the CP 01047 charset to ISO 8859-1 character codes, so that translation back to the CP 01047 charset is an exact value-preserving round-trip conversion....
ISO 8859
ISO/IEC 8859
ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12...

:
- ISO 8859-1
  ISO/IEC 8859-1
  ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally...
  
  Western Europe
- ISO 8859-2
  ISO/IEC 8859-2
  ISO/IEC 8859-2:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 2: Latin alphabet No. 2, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as "Latin-2". It is generally...
  
  Western and Central Europe
- ISO 8859-3
  ISO/IEC 8859-3
  ISO/IEC 8859-3:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 3: Latin alphabet No. 3, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin-3 or South European...
  
  Western Europe and South European (Turkish, Maltese plus Esperanto)
- ISO 8859-4
  ISO/IEC 8859-4
  ISO/IEC 8859-4:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 4: Latin alphabet No. 4, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin-4 or North European. It...
  
  Western Europe and Baltic countries (Lithuania, Estonia, Latvia and Lapp)
- ISO 8859-5
  ISO/IEC 8859-5
  ISO/IEC 8859-5:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 5: Latin/Cyrillic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1988. It is informally referred to as Latin/Cyrillic...
  
  Cyrillic alphabet
- ISO 8859-6
  ISO/IEC 8859-6
  ISO/IEC 8859-6:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 6: Latin/Arabic alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Arabic. It was...
  
  Arabic
- ISO 8859-7
  ISO/IEC 8859-7
  ISO/IEC 8859-7:2003, Information technology — 8-bit single-byte coded graphic character sets — Part 7: Latin/Greek alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Greek. It was designed...
  
  Greek
- ISO 8859-8
  ISO/IEC 8859-8
  ISO/IEC 8859-8:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 8: Latin/Hebrew alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin/Hebrew...
  
  Hebrew
- ISO 8859-9
  ISO/IEC 8859-9
  ISO/IEC 8859-9:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 9: Latin alphabet No. 5, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1989. It is informally referred to as Latin-5 or Turkish...
  
  Western Europe with amended Turkish character set
- ISO 8859-10
  ISO/IEC 8859-10
  ISO/IEC 8859-10:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 10: Latin alphabet No. 6, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1992. It is informally referred to as Latin-6. It was designed to...
  
  Western Europe with rationalised character set for Nordic languages, including complete Icelandic set
- ISO 8859-11
  ISO/IEC 8859-11
  ISO/IEC 8859-11:2001, Information technology — 8-bit single-byte coded graphic character sets — Part 11: Latin/Thai alphabet, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. It is informally referred to as Latin/Thai. It is nearly...
  
  Thai
- ISO 8859-13
  ISO/IEC 8859-13
  ISO/IEC 8859-13:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 13: Latin alphabet No. 7, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1998. It is informally referred to as Latin-7 or Baltic Rim...
  
  Baltic languages plus Polish
- ISO 8859-14
  ISO/IEC 8859-14
  ISO/IEC 8859-14:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 14: Latin alphabet No. 8 , is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1998. It is informally referred to as Latin-8 or Celtic...
  
  Celtic languages (Irish Gaelic, Scottish, Welsh)
- ISO 8859-15
  ISO/IEC 8859-15
  ISO/IEC 8859-15:1999, Information technology — 8-bit single-byte coded graphic character sets — Part 15: Latin alphabet No. 9, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1999. It is informally referred to as Latin-9...
  
  Added the Euro sign and other rationalisations to ISO 8859-1
- ISO 8859-16
  ISO/IEC 8859-16
  ISO/IEC 8859-16:2001, Information technology — 8-bit single-byte coded graphic character sets — Part 16: Latin alphabet No. 10, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. It is informally referred to as Latin-10 or South-Eastern...
  
  Central, Eastern and Southern European languages (Albanian, Croatian, Hungarian, Polish, Romanian, Serbian and Slovenian, but also French, German, Italian and Irish Gaelic)
CP437
Code page 437
IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....

, CP737
Code page 737
Code page 737 is a code page used under MS-DOS to write Greek language. It was much more popular than code page 869.-Code page layout:...

, CP850
Code page 850
Code page 850 is a code page used under MS-DOS in Western Europe. It is the code page commonly used by the version of MS-DOS underlying Windows ME...

, CP852
Code page 852
Code page 852 is a code page used under MS-DOS to write Central European languages that use Latin script ....

, CP855
Code page 855
Code page 855 is a code page used under MS-DOS to write Cyrillic script. This code page is not used much.-Code page layout:...

, CP857
Code page 857
Code page 857 is a code page used under MS-DOS to write Turkish.Code page 857 is based on code page 850, but with many changes. It includes all characters from ISO 8859-9.-Code page layout:...

, CP858
Code page 858
Code page 858 is a code page used under MS-DOS to write Western European languages.Code page 858 was created from code page 850 in 1998 by changing code point 213 from dotless I ⟨ı⟩ to the euro sign ⟨€⟩....

, CP860
Code page 860
Code page 860 is a code page used under MS-DOS to write Portuguese.-Code page layout:...

, CP861
Code page 861
Code page 861 is a code page used under MS-DOS to write the Icelandic language .-Code page layout:...

, CP862
Code page 862
Code page 862 is a code page used under MS-DOS for Hebrew.Like ISO 8859-8, it encodes only letters, not vowel-points or cantillation marks...

, CP863
Code page 863
Code page 863 is a code page used under MS-DOS to write French language .-Code page layout:...

, CP865
Code page 865
Code page 865 is a code page used under MS-DOS to write Nordic languages ....

, CP866
Code page 866
Code page 866 is a code page used under MS-DOS to write Cyrillic script. It is based on the "alternative character set" of GOST 19768-87...

, CP869
Code page 869
Code page 869 is a code page used under MS-DOS to write Greek language. It is also called MS-DOS Greek 2. It was designed to include all characters from ISO 8859-7.Code page 869 was not as popular as code page 737....
MS-Windows character sets:
- Windows-1250
  Windows-1250
  Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian , Romanian and Albanian...
  
  for Central European languages that use Latin script, (Polish, Czech, Slovak, Hungarian, Slovene, Serbian, Croatian, Romanian and Albanian)
- Windows-1251
  Windows-1251
  Windows-1251 is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian, Serbian Cyrillic and other languages...
  
  for Cyrillic alphabets
- Windows-1252
  Windows-1252
  Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
  
  for Western languages
- Windows-1253
  Windows-1253
  Windows-1253 is a Windows code page used to write modern Greek. It is not capable of supporting the older polytonic Greek. It is not fully compatible with ISO 8859-7 because the letters like Ά are located at different byte values....
  
  for Greek
- Windows-1254
  Windows-1254
  Windows-1254 is a code page used under Microsoft Windows to write Turkish. Characters with codepoints A0 through FF are compatible with ISO 8859-9.Unicode is preferred to windows 1254 for modern applications- Code page layout :...
  
  for Turkish
- Windows-1255
  Windows-1255
  Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols are in the same positions Windows-1255 is a codepage used under Microsoft Windows to write Hebrew. It is an almost compatible superset of ISO 8859-8 — the symbols...
  
  for Hebrew
- Windows-1256
  Windows-1256
  Windows-1256 is a code page used to write Arabic under Microsoft Windows. This code page is not compatible with ISO 8859-6 and MacArabic encodings....
  
  for Arabic
- Windows-1257
  Windows-1257
  Windows-1257 is a single byte code page used to support the Estonian, Latvian and Lithuanian languages under Microsoft Windows. This code page is similar in layout to ISO 8859-13, but they differ in codepoints A1, A5, B4, FF, and of course in the range 80–9F, which is typically allocated with...
  
  for Baltic languages
- Windows-1258
  Windows-1258
  Windows-1258 is a codepage used in Microsoft Windows to represent Vietnamese texts. It makes use of combining diacritical marks. Windows-1258 is not compatible with VISCII...
  
  for Vietnamese
Mac OS Roman
Mac OS Roman
Mac OS Roman is a character encoding primarily used by Mac OS to represent text. It encodes 256 characters, the first 128 of which are identical to ASCII, with the remaining characters including mathematical symbols, diacritics, and additional punctuation marks. It is suitable for use to represent...
KOI8-R
KOI8-R
KOI8-R is an 8-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet. It also happens to cover Bulgarian, but is not used since CP1251 is accepted. A derivative encoding is KOI8-U, which adds Ukrainian characters...

, KOI8-U
KOI8-U
KOI8-U is an 8-bit character encoding, designed to cover Ukrainian, which uses the Cyrillic alphabet. It is based on KOI8-R, which covers Russian and Bulgarian, but replaces eight graphic characters with four Ukrainian letters Ґ, Є, І, and Ї in both upper case and lower case.In Microsoft Windows,...

, KOI7
KOI7
KOI7 is a 7-bit character encoding, designed to cover Russian, which uses the Cyrillic alphabet.In Russian, KOI7 stands for Kod Obmena Informatsiey, 7 bit which means "Code for Information Exchange, 7 bit"....
MIK
MIK Code page
MIK is a Cyrillic code page used with MS-DOS. It is based on the character set used in the Bulgarian IBM PC compatible system.This is the most widespread DOS/OEM code page used in Bulgaria, rather than CP 855, CP 866 or CP 872....
ISCII
TSCII
VISCII
JIS X 0208
JIS X 0208
JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. The official title of the current standard is...

is a widely deployed standard for Japanese character encoding that has several encoding forms.
- Shift JIS (Microsoft Code page 932
  Code page 932
  Code page 932 is Microsoft's extension of Shift JIS to include NEC special characters , NEC selection of IBM extensions , and IBM extensions . The coded character sets are JIS X0201:1997, JIS X0208:1997, and these extensions...
  
  is a dialect of Shift_JIS)
- EUC-JP
  Extended Unix Code
  Extended Unix Code is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 characters, or 830584 ...
- ISO-2022-JP
  ISO/IEC 2022
  ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard specifying...
JIS X 0213
JIS X 0213
JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 . As well as adding a number of special characters, characters with diacritic marks,...

is an extended version of JIS X 0208.
- Shift JIS-2004
- EUC-JIS-2004
  Extended Unix Code
  Extended Unix Code is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 characters, or 830584 ...
- ISO-2022-JP-2004
  ISO/IEC 2022
  ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard specifying...
Chinese Guobiao
- GB 2312
  GB 2312
  GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters...
- GBK
  GBK
  GBK is an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.GB abbreviates Guojia Biaozhun , which means national standard in Chinese, while K stands for Extension...
  
  (Microsoft Code page 936)
- GB 18030
  GB 18030
  GB18030 is a Chinese government standard describing the required language and character support necessary for software in China. In addition to the "GB18030 code page" this standard contains requirements about which scripts must be supported, font support, etc....
Taiwan Big5
Big5
Big-5 or Big5 is a character encoding method used in Taiwan, Hong Kong, and Macau for Traditional Chinese characters.Mainland China, which uses Simplified Chinese Characters, uses the GB instead.- Organization :...

(a more famous variant is Microsoft Code page 950
Code page 950
Code page 950 is Microsoft's implementation of the de facto standard Big5. The code page is not registered with IANA, and hence, is not a standard to communicate information over the internet. The major difference between code page 950 and Big5 is the incorporation of some ETEN characters at...

)
Hong Kong HKSCS
HKSCS
The Hong Kong Supplementary Character Set is a set of Chinese characters -- 4,702 in total in the initial release—used in Cantonese, as well as when writing the names of some places in Hong Kong . It evolved from the preceding Government Chinese Character Set or GCCS...
Korean
- KS X 1001
  KS X 1001
  KS X 1001 is a South Korean coded character set standard to represent hangul and hanja characters on a computer. It is arranged as 94×94 table , therefore its code points are pairs of integers 1–94...
  
  is a Korean double-byte character encoding standard
- EUC-KR
- ISO-2022-KR
  ISO/IEC 2022
  ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard specifying...
Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

(and subsets thereof, such as the 16-bit 'Basic Multilingual Plane'). See UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
ANSEL
ANSEL
ANSEL, American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use, is a character set used in text encodings like MARC-8...

or ISO/IEC 6937
ISO/IEC 6937
ISO/IEC 6937:2001, Information technology — Coded graphic character set for text communication — Latin alphabet, is a multibyte extension of ASCII, or rather of ISO/IEC 646-IRV. It was developed in common with ITU-T for telematic services under the name of T.51, and first became an ISO standard in...

Character encoding translation

As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between encoding schemes. Some of these are cited below.

Cross-platform

Cross-platform

In computing, cross-platform, or multi-platform, is an attribute conferred to computer software or computing methods and concepts that are implemented and inter-operate on multiple computer platforms...

Web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...

s – most modern web browsers feature automatic character encoding detection. On Firefox 3, for example, see the View/Character Encoding submenu.
iconv
Iconv
iconv is a computer program and a standardized API used to convert between different character encodings.-iconv API:The iconv API is the standard programming interface for converting character strings from one character encoding to another in Unix-like operating systems.Initially appearing on the...

– program and standardized API to convert encodings
convert_encoding.py – Python based utility to convert text files between arbitrary encodings and line endings.
decodeh.py – algorithm and module to heuristically guess the encoding of a string.
International Components for Unicode
International Components for Unicode
International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

– A set of C and Java libraries to perform charset conversion. uconv can be used from ICU4C.
chardet – This is a translation of the Mozilla
Mozilla
Mozilla is a term used in a number of ways in relation to the Mozilla.org project and the Mozilla Foundation, their defunct commercial predecessor Netscape Communications Corporation, and their related application software....

automatic-encoding-detection code into the Python computer language.
The newer versions of the unix File command attempt to do a basic detection of character encoding. (also available on cygwin and mac)

Linux

Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

recode – convert file contents from one encoding to another
utrac – convert file contents from one encoding to another.
cstocs – convert file contents from one encoding to another
convmv – convert a filename from one encoding to another.
enca – analyzes encodings for given text files.

Windows

Microsoft Windows

Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

Encoding.Convert – .NET API
MultiByteToWideChar/WideCharToMultiByte – Convert from ANSI to Unicode & Unicode to ANSI
cscvt – character set conversion tool
enca – analyzes encodings for given text files.

Unicode encoding model

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

and its parallel standard, the ISO/IEC 10646 Universal Character Set

Universal Character Set

The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...

, together constitute a modern, unified character encoding. Rather than mapping characters directly to octets (byte

Byte

The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

s), they separately define what characters are available, their numbering, how those numbers are encoded as a series of "code units" (limited-size numbers), and finally how those units are encoded as a stream of octets. The idea behind this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To correctly describe this model one needs more precise terms than "character set" and "character encoding". The terms used in the modern model follow:

A character repertoire is the full set of abstract characters that a system supports. The repertoire may be closed, i.e. no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code page

Windows code page

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s...

s). The characters in a given repertoire reflect decisions that have been made about how to divide writing systems into linear information units. The basic variants of the Latin

Latin alphabet

, Greek

Greek alphabet

The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

, and Cyrillic alphabet

Cyrillic alphabet

The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

s, can be broken down into letters, digits, punctuation, and a few special characters like the space, which can all be arranged in simple linear sequences that are displayed in the same order they are read. Even with these alphabets however diacritic

Diacritic

A diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...

s pose a complication: they can be regarded either as part of a single character containing a letter and diacritic (known in modern terminology as a precomposed character), or as separate characters. The former allows a far simpler text handling system but the latter allows any letter/diacritic combination to be used in text. Other writing systems, such as Arabic and Hebrew, are represented with more complex character repertoires due to the need to accommodate things like bidirectional text and glyph

Glyph

A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

s that are joined together in different ways for different situations.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of non-negative integer codes called code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...

s. For example, in a given repertoire, a character representing the capital letter "A" in the Latin alphabet might be assigned to the integer 65, the character for "B" to 66, and so on. A complete set of characters and corresponding integers is a coded character set. Multiple coded character sets may share the same repertoire; for example ISO/IEC 8859-1

ISO/IEC 8859-1

ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally...

and IBM code pages 037 and 500 all cover the same repertoire but map them to different codes. In a coded character set, each code point only represents one character, i.e., a coded character set is a function

Function (mathematics)

In mathematics, a function associates one quantity, the argument of the function, also known as the input, with another quantity, the value of the function, also known as the output. A function assigns exactly one output to each input. The argument and the value may be real numbers, but they can...

.

A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits (i.e. practically any computer system). For example, a system that stores numeric information in 16-bit units would only be able to directly represent integers from 0 to 65,535 in each unit, but larger integers could be represented if more than one 16-bit unit could be used. This is what a CEF accommodates: it defines a way of mapping a single code point from a range of, say, 0 to 1.4 million, to a series of one or more code values from a range of, say, 0 to 65,535.

The simplest CEF system is simply to choose large enough units that the values from the coded character set can be encoded directly (one code point to one code value). This works well for coded character sets that fit in 8 bits (as most legacy non-CJK encodings do) and reasonably well for coded character sets that fit in 16 bits (such as early versions of Unicode). However, as the size of the coded character set increases (e.g. modern Unicode requires at least 21 bits/character), this becomes less and less efficient, and it is difficult to adapt existing systems to use larger code values. Therefore, most systems working with later versions of Unicode use either UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, which maps Unicode code points to variable-length sequences of octets, or UTF-16, which maps Unicode code points to variable-length sequences of 16-bit words.

Next, a character encoding scheme (CES) specifies how the fixed-size integer code values should be mapped into an octet sequence suitable for saving on an octet-based file system or transmitting over an octet-based network. With Unicode, a simple character encoding scheme is used in most cases, simply specifying whether the bytes for each integer should be in big-endian

Endianness

In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

or little-endian order (even this isn't needed with UTF-8). However, there are also compound character encoding schemes, which use escape sequences to switch between several simple schemes (such as ISO/IEC 2022

ISO/IEC 2022

ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO standard specifying...

), and compressing schemes, which try to minimise the number of bytes used per code unit (such as SCSU

Standard Compression Scheme for Unicode

The Standard Compression Scheme for Unicode is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the...

, BOCU

Binary Ordered Compression for Unicode

Binary Ordered Compression for Unicode is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode . This Unicode encoding is designed to be useful for compressing short strings, and maintains code...

, and Punycode

Punycode

In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set....

). See comparison of Unicode encodings

Comparison of Unicode encodings

This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so...

for a detailed discussion.

Finally, there may be a higher level protocol which supplies additional information that can be used to select the particular variant of a Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

character, particularly where there are regional variants that have been 'unified' in Unicode as the same character. An example is the XML

XML

Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

attribute xml:lang.

The Unicode model reserves the term character map for historical systems which directly assign a sequence of characters to a sequence of bytes, covering all of CCS, CEF and CES layers.

General terminology

In computer science

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, the terms character encoding, character map, character set or code page were historically synonymous, as the same standard would specify a repertoire of characters and how they were to be encoded into a stream of code units – usually with a single character per code unit. The terms now have related but distinct meanings, reflecting the efforts of standards bodies to use precise terminology when writing about and unifying many different encoding systems. Regardless, the terms are still used interchangeably, with character set being nearly ubiquitous.

A code page
Code page
Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...

usually means a byte oriented

Byte oriented

Byte orientation refers to forms of data processing in which digital data are processed bytewise. For example, communication is called byte oriented or character oriented when the transmitted information is grouped into bytes....

encoding, but with emphasis to some suite of encodings (covering different scripts), where many characters share same codes

Code point

In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...

in all these code pages (or most). Well known code page suites are "Windows" (based on Windows-1252) and "IBM"/"DOS" (based on code page 437), see Windows code page

Windows code page

Windows code pages are sets of characters or code pages used in Microsoft Windows from the 1980s and 1990s...

for details. Most encodings referred to as code pages, but not all of them, are single byte encodings.

IBM's Character Data Representation Architecture (CDRA) designates with coded character set identifiers (CCSID

CCSID

CCSID is an abbreviation used by IBM to mean "Coded Character Set Identifier". It is a 16-bit number that represents a specific encoding of a specific code page...

s) and each of which is variously called a charset, character set, code page, or CHARMAP.

The term code page does not occur in Unix or Linux where charmap is preferred, usually in the larger context of locales.

Contrasted to CCS above, a character encoding is a map from abstract characters to code word

Code word

In communication, a code word is an element of a standardized code or protocol. Each code word is assembled in accordance with the specific rules of the code and assigned a unique meaning...

s. A character set in HTTP (and MIME

MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

) parlance is the same as a character encoding (but not the same as CCS).

Legacy
Legacy system
A legacy system is an old method, technology, computer system, or application program that continues to be used, typically because it still functions for the users' needs, even though newer technology or more efficient methods of performing a task are now available...

encoding is a term sometimes used to characterize old character encodings, but with an ambiguity of sense. Most of its use is in the context of Unicodification

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

, where it refers to encodings that fail to cover all Unicode code points, or, more generally, using a somewhat different character repertoire: several code points representing one Unicode character, or versa (see e.g. code page 437

Code page 437

IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....

). Some sources refer to an encoding as legacy only because it preceded Unicode. All Windows code pages are usually referred to as legacy, both because they antedate Unicode and because they are unable to represent all 2²¹ possible Unicode code points.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

History

Common character encodings

Character encoding translation

Unicode encoding model

General terminology

See also

External links