Unicode equivalence
Encyclopedia
Unicode equivalence is the specification by the Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....

 encoding standard that some sequences of code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...

s represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

Unicode provides two such notions, canonical equivalence and compatibility. Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

 lowercase 'n') followed by U+0303 (the combining
Combining character
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks ....

 tilde
Tilde
The tilde is a grapheme with several uses. The name of the character comes from Portuguese and Spanish, from the Latin titulus meaning "title" or "superscription", though the term "tilde" has evolved and now has a different meaning in linguistics....

 '◌̃') is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter 'ñ
Ñ
Ñ is a letter of the modern Latin alphabet, formed by an N with a diacritical tilde. It is used in the Spanish alphabet, Galician alphabet, Asturian alphabet, Basque alphabet, Aragonese old alphabet , Filipino alphabet, Chamorro alphabet and the Guarani alphabet, where it represents...

' of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

Sequences that are defined as compatible are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the typographic ligature 'ff') is defined to be compatible — but not canonically equivalent — to the sequence U+0066 U+0066 (two Latin 'f' letters). Compatible sequences may be treated the same way in some applications (such as sorting
Sorting
Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...

 and indexing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.

The standard also defines a text normalization
Text normalization
Text normalization is a process by which text is transformed in some way to make it consistent in a way which it might not have been before. Text normalization is often performed before text is processed in some way, such as generating synthesized speech, automated language translation, storage in...

 procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one fully composed (where multiple code points are replaced by single points whenever possible), and one fully decomposed (where single points are split into multiple ones). Each of these four normal forms can be used in text processing.

Character duplication

For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the character 'Å' can be encoded as U+00C5 (standard name 'LATIN CAPITAL LETTER A WITH RING ABOVE', a letter of the alphabet
Alphabet
An alphabet is a standard set of letters—basic written symbols or graphemes—each of which represents a phoneme in a spoken language, either as it exists now or as it was in the past. There are other systems, such as logographies, in which each character represents a word, morpheme, or semantic...

 in Swedish
Swedish language
Swedish is a North Germanic language, spoken by approximately 10 million people, predominantly in Sweden and parts of Finland, especially along its coast and on the Åland islands. It is largely mutually intelligible with Norwegian and Danish...

 and several other language
Language
Language may refer either to the specifically human capacity for acquiring and using complex systems of communication, or to a specific instance of such a system of complex communication...

s) or as U+212B ('ANGSTROM SIGN'). Yet the symbol for angstrom
Ångström
The angstrom or ångström, is a unit of length equal to 1/10,000,000,000 of a meter . Its symbol is the Swedish letter Å....

 is defined to be that Swedish letter, and most other symbols that are letters (like 'V' for volt
Volt
The volt is the SI derived unit for electric potential, electric potential difference, and electromotive force. The volt is named in honor of the Italian physicist Alessandro Volta , who invented the voltaic pile, possibly the first chemical battery.- Definition :A single volt is defined as the...

) do not have a separate code point for each usage. In general, the code points of truly identical characters (which can be rendered in the same way in Unicode fonts) are defined to be canonically equivalent.

Combining and precomposed characters

For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for 'ñ' or U+00C5 for 'Å') or as combinations of two or more characters (such as U+FB00 for the ligature 'ff' or U+0132 for the Dutch letter
Dutch alphabet
The modern Dutch alphabet consists of the 26 letters of the ISO basic Latin alphabet and is used for the Dutch language. Five letters are vowels and 21 letters are consonants.- History :...

 ' IJ')

For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding base character. Examples of these combining character
Combining character
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks ....

s are the combining tilde and the Japanese diacritic dakuten
Dakuten
, colloquially ten-ten , is a diacritic sign most often used in the Japanese kana syllabaries to indicate that the consonant of a syllable should be pronounced voiced. Handakuten , colloquially maru , is a diacritic used with the kana for syllables starting with h to indicate that they should...

 ('◌゛', U+3099).

In the context of Unicode, character composition is the process of replacing the code points of a base letter followed by one or more combining characters into a single precomposed character
Precomposed character
A precomposed character is a Unicode entity that can be defined as a combination of two or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é...

; and character decomposition is the opposite process.

In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whathever order these may occur.

Typographic conventions

Unicode provides point codes for some characters or groups of characters which are modified only for aesthetic reasons (such as ligatures, the half-width katakana
Katakana
is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji, and in some cases the Latin alphabet . The word katakana means "fragmentary kana", as the katakana scripts are derived from components of more complex kanji. Each kana represents one mora...

 characters, or the double-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in subscript or superscript positions, or the circled digits ① inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.

Normalization

The implementation of Unicode string searches and comparisons in text processing software must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.

Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an equivalence class, multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a canonical ordering on the code point sequence, which is necessary for the normal forms to be unique.

In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), roman numerals
Roman numerals
The numeral system of ancient Rome, or Roman numerals, uses combinations of letters from the Latin alphabet to signify values. The numbers 1 to 10 can be expressed in Roman numerals as:...

 like U+2168 (Ⅸ) and even subscripts and superscripts
Unicode subscripts and superscripts
Unicode has subscripted and superscripted versions of a number of characters including a full set of arabic numerals. These characters allow any polynomial, chemical and certain other equations to be represented in plain text without using any form of markup like HTML or TeX.The World Wide Web...

, e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature in the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman Numeral Ⅸ (U+2168). Similarly the superscript “⁵” (U+2075) is transformed to “5” (U+0035) by compatibility mapping.

Transforming superscripts into baseline equivalents may not be appropriate however for rich text software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains compatibility formatting tags that provide additional details on the compatibility transformation. In the case of typographic ligatures, this tag is simply , while for the superscript it is . Rich text standards like HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 take into account the compatibility tags. For instance HTML uses its own markup to position a U+0035 in a superscript position.

Normal forms

The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.
NFD
Normalization Form Canonical Decomposition
Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
NFC
Normalization Form Canonical Composition
Characters are decomposed and then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition
Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
NFKC
Normalization Form Compatibility Composition
Characters are decomposed by compatibility, then recomposed by canonical equivalence.


All these algorithms are idempotent transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.

However, none of them is injective due to the unification of equivalent symbols and canonical reordering of the combining symbols. For example, the distinct Unicode strings "U+212B" (the angstrom sign 'Å') and "U+00C5" (the Swedish letter 'Å') are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter 'A' and combining ring above '°') which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter 'Å').

Also, none of the normal forms are closed
Closure (mathematics)
In mathematics, a set is said to be closed under some operation if performance of that operation on members of the set always produces a unique member of the same set. For example, the real numbers are closed under subtraction, but the natural numbers are not: 3 and 8 are both natural numbers, but...

 under string concatenation
Concatenation
In computer programming, string concatenation is the operation of joining two character strings end-to-end. For example, the strings "snow" and "ball" may be concatenated to give "snowball"...

, meaning that the concatenation of two strings in the same normal form may not be itself in that normal form. This happens, for example, when a base character at the end of the first string is modified by combining characters at the beginning of the second string.

A single character that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.

Canonical ordering

The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be diacritics
Diacritics
diacritics is a quarterly academic journal established in 1971 at Cornell University and published by the Johns Hopkins University Press. Articles serve to review recent literature in the field of literary criticism, and have covered topics in gender studies, political theory, psychoanalysis, queer...

, even though in general some diacritics are not combining characters, and some combining characters are not diacritics.

Unicode assigns each character a combining class, which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a stable sorting algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are not considered equivalent.

For example, the character U+1EBF (ế), used in Vietnamese
Vietnamese alphabet
The Vietnamese alphabet, called Chữ Quốc Ngữ , usually shortened to Quốc Ngữ , is the modern writing system for the Vietnamese language...

 has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent with U+0065 U+0301 U+0302.

Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.

Errors due to normalization differences

When two applications share Unicode data, but use different normal forms or use them incorrectly, errors and data loss can result. For example, Mac OS X has many components that prefer or require only decomposed characters (thus decomposed-only Unicode encoded with UTF-8 is also known as "UTF8-MAC"). In one specific instance, the combination of OS X errors handling composed characters, and the samba
Samba (software)
Samba is a free software re-implementation, originally developed by Andrew Tridgell, of the SMB/CIFS networking protocol. As of version 3, Samba provides file and print services for various Microsoft Windows clients and can integrate with a Windows Server domain, either as a Primary Domain...

 file- and printer-sharing software (which replaces decomposed letters with composed ones when copying file names), has led to confusing and data-destroying interoperability problems. Applications may avoid such errors by preserving input code points, and only normalizing them to the application's preferred normal form for internal use.

See also

  • Unicode
    Unicode
    Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

  • Ligature (typography)
    Ligature (typography)
    In writing and typography, a ligature occurs where two or more graphemes are joined as a single glyph. Ligatures usually replace consecutive characters sharing common components and are part of a more general class of glyphs called "contextual forms", where the specific shape of a letter depends on...

  • Diacritic
    Diacritic
    A diacritic is a glyph added to a letter, or basic glyph. The term derives from the Greek διακριτικός . Diacritic is both an adjective and a noun, whereas diacritical is only an adjective. Some diacritical marks, such as the acute and grave are often called accents...

  • Precomposed character
    Precomposed character
    A precomposed character is a Unicode entity that can be defined as a combination of two or more other characters. A precomposed character may typically represent a letter with a diacritical mark, such as é...

  • Unicode compatibility characters
    Unicode compatibility characters
    In discussing Unicode and the UCS, many often refer to compatibility characters. Compatibility characters are graphical characters that are discouraged by the Unicode Consortium...

  • Complex text layout
    Complex Text Layout
    Complex text layout or complex text rendering refers to the typesetting of writing systems which require complex transformations between text input and text display for proper rendering on the screen or the printed page...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK