Code point - AbsoluteAstronomy.com

Character encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

terminology, a code point or code position is any of the numerical values that make up the code space (or code page

Code page

Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...

). For example, ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

comprises 128 code points in the range 0_{hexHexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...} to 7F_hex, Extended ASCII

Extended ASCII

The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

comprises 256 code points in the range 0_{hexHexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...} to FF_hex, and Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

comprises 1,114,112 code points in the range 0_{hexHexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...} to 10FFFF_hex. The Unicode code space is divided into seventeen planes

Mapping of Unicode character planes

In the Unicode system, planes are groups of numerical values that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal...

(the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 2¹⁶) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

Definition

The notion of a code point is used for abstraction, to distinguish both:

the number from an encoding as a sequence of bits, and
the abstract character from a particular graphical representation (glyph
Glyph
A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

).

This is because one may wish to make these distinctions:

encode a particular code space in different ways, or
display a character via different glyphs.

For Unicode, the particular sequence of bits is called a code value – for the UCS-4 encoding, characters/code points are encoded as 4-byte (octet

Octet

-Music:* Octet , ensemble consisting of eight instruments or voices, or composition written for such an ensemble* Octet , 1793 composition by Ludwig van Beethoven* Octet , 1825 composition by Felix Mendelssohn...

) binary numbers (which is fixed width and simple, but inefficient), while in the UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

encoding, characters are encoded as 1 to 4 byte numbers (which is variable-width

Variable-width encoding

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...

, hence more efficient but more complex, and backwards compatible with ASCII).

Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph

Glyph

A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

but a unit of textual data. The precise appearance of the character depends on the font. However code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.

Unicode text

A Unicode text file is not necessarily merely a sequence of code points encoded into 4 byte blocks. Instead, an encoding scheme is used to serialize a sequence of code points into a sequence of bytes. A number of such schemes exist, and these trade between space efficiency and ease of encoding. A variable number of bytes can be used for each character. For example, UTF-8

UTF-8

maintains some compatibility with ASCII. Encoding schemes also take into account endianness

Endianness

In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

, and may have the property of being a self-synchronizing code

Self-synchronizing code

In telecommunications, a self-synchronizing code is a line code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a valid code word...

, meaning character boundaries can be found without having to read from the beginning of the string.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.