Precomposed character - AbsoluteAstronomy.com

A precomposed character is a Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

entity that can be defined as a combination of two or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é (Latin small letter e with acute accent

Acute accent

The acute accent is a diacritic used in many modern written languages with alphabets based on the Latin, Cyrillic, and Greek scripts.-Apex:An early precursor of the acute accent was the apex, used in Latin inscriptions to mark long vowels.-Greek:...

). Technically, é (U+00E9) is a character that can be decomposed into an equivalent

Unicode equivalence

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character...

string of the base letter e (U+0065) and combining

Combining character

In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks ....

acute accent (U+0301). Similarly, ligatures

Ligature (typography)

In writing and typography, a ligature occurs where two or more graphemes are joined as a single glyph. Ligatures usually replace consecutive characters sharing common components and are part of a more general class of glyphs called "contextual forms", where the specific shape of a letter depends on...

are precompositions of their constituent letters or graphemes.

Precomposed characters are the legacy solution for representing many special letters in various character sets. In Unicode they are included primarily to aid computer systems with incomplete Unicode support, where equivalent decomposed characters may render incorrectly.

Comparing precomposed and decomposed characters

In the following example, there is a common Swedish

Swedish language

Swedish is a North Germanic language, spoken by approximately 10 million people, predominantly in Sweden and parts of Finland, especially along its coast and on the Åland islands. It is largely mutually intelligible with Norwegian and Danish...

surname Åström written in the two alternative methods, the first one with a precomposed Å
Å
Å represents various sounds in several languages. Å is part of the alphabets used for the Alemannic and the Bavarian-Austrian dialects of German...

(U+00C5) and ö
Ö
"Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...

(U+00F6), and the second one using a decomposed base letter A
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...

(U+0041) with a combining ring above (U+030A) and an o
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...

(U+006F) with a combining diaeresis (U+0308). To illustrate the difference, the precomposed characters are here displayed in green and the decomposed base letters in black; depending on your browser

Web browser

A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...

, the decomposed combining diacritics may be shown in orange or black.
Except for the different colors, the two solutions are equivalent and should render identically. In practice, however, some Unicode implementations still have difficulties with decomposed characters. In the worst case, combining diacritics may be disregarded or rendered as unrecognized characters after their base letters, as they are not included in all font

Font

In typography, a font is traditionally defined as a quantity of sorts composing a complete character set of a single size and style of a particular typeface...

s. To overcome the problems, some applications may simply attempt to replace the decomposed characters with the equivalent precomposed characters.

With an incomplete font, however, precomposed characters may also be problematic – especially if they are more exotic, as in the following example (showing the reconstructed Proto-Indo-European

Proto-Indo-European language

The Proto-Indo-European language is the reconstructed common ancestor of the Indo-European languages, spoken by the Proto-Indo-Europeans...

word for 'dog'):
In some situations, the precomposed green k

K is the eleventh letter of the English and basic modern Latin alphabet.-History and usage:In English, the letter K usually represents the voiceless velar plosive; this sound is also transcribed by in the International Phonetic Alphabet and X-SAMPA....

, u

U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....

and o

O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...

with diacritics may render as unrecognized characters

Mojibake

, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...

, or their typographical

Typography

Typography is the art and technique of arranging type in order to make language visible. The arrangement of type involves the selection of typefaces, point size, line length, leading , adjusting the spaces between groups of letters and adjusting the space between pairs of letters...

appearance may be very different from the final letter n

N is the fourteenth letter in the basic modern Latin alphabet.- History of the forms :One of the most common hieroglyphs, snake, was used in Egyptian writing to stand for a sound like English ⟨J⟩, because the Egyptian word for "snake" was djet...

with no diacritic. On the second line, the base letters should at least render correctly even if the combining diacritics could not be recognized.

OpenType

OpenType

OpenType is a format for scalable computer fonts. It was built on its predecessor TrueType, retaining TrueType's basic structure and adding many intricate data structures for prescribing typographic behavior...

has the ccmp "feature tag" to define glyphs that are compositions or decompositions involving combining characters.

Chinese characters

In theory, most Chinese character

Chinese character

Chinese characters are logograms used in the writing of Chinese and Japanese , less frequently Korean , formerly Vietnamese , or other languages...

s as encoded by Han unification

Han unification

Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters. Han characters are a common feature of written Chinese , Japanese , Korean , and—at least historically—other...

and similar schemes could be treated as precomposed characters, since they can be reduced (decomposed) to their constituent strokes

Stroke order

Stroke order refers to the order in which the strokes of a Chinese character are written. A stroke is a movement of a writing instrument on a writing surface. Chinese characters are used in various forms in Chinese, Japanese, and in Korean...

and ideograph descriptions, though Unicode does not take this approach that would certainly be on the cutting edge of text storage and layout. Such an approach could potentially reduce the number of characters in the character set from tens of thousands to just a few hundred. On the other hand, a character set encoded in this way would also produce documents that were tenfold larger in bytes to represent the same characters as Unicode.

Sources

The Unicode Standard, Version 5.2: Conformance (see Section 3.7 for Decomposition). The Unicode Consortium, December 2009.
Aaron Weiss: Composite and Precomposed Characters. Web Developer's Virtual Library. February 20, 2001.
MSDN: Defining a Character Set. April 8, 2010.

External links

Free Idg Serif, a derivative of the FreeSerif font with added declarations of precomposed characters.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Comparing precomposed and decomposed characters

Chinese characters

See also

Sources

External links