Unicode collation algorithm
Encyclopedia
The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

. These comparisons can then be used to collate or sort text in any writing system
Writing system
A writing system is a symbolic system used to represent elements or statements expressible in language.-General properties:Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to...

 and language
Language
Language may refer either to the specifically human capacity for acquiring and using complex systems of communication, or to a specific instance of such a system of complex communication...

 that can be represented with Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

.

Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET). This datafile specifies the default collation ordering. The DUCET is customizable for different languages. Some such customisations can be found in Common Locale Data Repository
Common Locale Data Repository
The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications. CLDR contains locale specific information that an operating system will typically provide to applications. CLDR is...

 (CLDR).

An important open source implementation of UCA is included with the International Components for Unicode
International Components for Unicode
International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

, ICU. ICU also supports tailoring and the collation tailorings from CLDR are included in ICU. You can see the effects of tailoring and a large number of language specific tailorings in the on-line ICU Locale Explorer.

See also

  • Collation
    Collation
    Collation is the assembly of written information into a standard order. One common type of collation is called alphabetization, though collation is not limited to ordering letters of the alphabet...

  • ISO/IEC 14651
    ISO 14651
    ISO/IEC 14651:2007, Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering, is an ISO Standard specifying an algorithm that can be used when comparing two strings. This comparison...

  • European ordering rules
    European ordering rules
    The European ordering rules , define an ordering for strings written in languages that are written with the Latin, Greek and Cyrillic alphabets. The standard covers languages used by the European Union, the European Free Trade Association, and parts of the former Soviet Union. It is a tailoring of...

     (EOR)
  • Common Locale Data Repository
    Common Locale Data Repository
    The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium to provide locale data in the XML format for use in computer applications. CLDR contains locale specific information that an operating system will typically provide to applications. CLDR is...

     (CLDR)

External links and references


Tools

  • ICU Locale Explorer An online demonstration of the Unicode Collation Algorithm using International Components for Unicode
    International Components for Unicode
    International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

  • msort A sort program that provides an unusual level of flexibility in defining collations and extracting keys.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK