Code point
Encyclopedia
In character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 terminology, a code point or code position is any of the numerical values that make up the code space (or code page
Code page
Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...

). For example, ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 comprises 128 code points in the range 0hex
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

to 7Fhex, Extended ASCII
Extended ASCII
The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

 comprises 256 code points in the range 0hex
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

to FFhex, and Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 comprises 1,114,112 code points in the range 0hex
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

to 10FFFFhex. The Unicode code space is divided into seventeen planes
Mapping of Unicode character planes
In the Unicode system, planes are groups of numerical values that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 code points. Planes are identified by the numbers 0 to 16decimal, which corresponds with the possible values 00-10hexadecimal...

 (the basic multilingual plane, and 16 supplementary planes), each with 65,536 (= 216) code points. Thus the total size of the Unicode code space is 17 × 65,536 = 1,114,112.

Definition

The notion of a code point is used for abstraction, to distinguish both:
  • the number from an encoding as a sequence of bits, and
  • the abstract character from a particular graphical representation (glyph
    Glyph
    A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

    ).


This is because one may wish to make these distinctions:
  • encode a particular code space in different ways, or
  • display a character via different glyphs.


For Unicode, the particular sequence of bits is called a code value – for the UCS-4 encoding, characters/code points are encoded as 4-byte (octet
Octet
-Music:* Octet , ensemble consisting of eight instruments or voices, or composition written for such an ensemble* Octet , 1793 composition by Ludwig van Beethoven* Octet , 1825 composition by Felix Mendelssohn...

) binary numbers (which is fixed width and simple, but inefficient), while in the UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 encoding, characters are encoded as 1 to 4 byte numbers (which is variable-width
Variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...

, hence more efficient but more complex, and backwards compatible with ASCII).

Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph
Glyph
A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

 but a unit of textual data. The precise appearance of the character depends on the font. However code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.

Unicode text

A Unicode text file is not necessarily merely a sequence of code points encoded into 4 byte blocks. Instead, an encoding scheme is used to serialize a sequence of code points into a sequence of bytes. A number of such schemes exist, and these trade between space efficiency and ease of encoding. A variable number of bytes can be used for each character. For example, UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 maintains some compatibility with ASCII. Encoding schemes also take into account endianness
Endianness
In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

, and may have the property of being a self-synchronizing code
Self-synchronizing code
In telecommunications, a self-synchronizing code is a line code in which the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a valid code word...

, meaning character boundaries can be found without having to read from the beginning of the string.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK