All Topics  
Collation

 

   Email Print
   Bookmark   Link






 

Collation



 
 
Collation is the assembly of written information into a standard order. One common type of collation is called alphabetisation, though collation is not limited to ordering letters of the alphabet
Alphabet

An alphabet is a standardized set of letter basic written symbols each of which roughly represents a phoneme, a spoken language, either as it exists now or as it was in the past....
. Collating lists of words or names into alphabetical order is the basis of most office filing systems, library catalog
Library catalog

A library catalog is a register of all bibliography items found in a library or group of libraries, such as a network of libraries at several locations....
s and reference books.

Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the ordering of those categories.

Advantages of sorted lists include:

A collation algorithm, e.g.






Discussion
Ask a question about 'Collation'
Start a new discussion about 'Collation'
Answer questions from other users
Full Discussion Forum



Encyclopedia


Collation is the assembly of written information into a standard order. One common type of collation is called alphabetisation, though collation is not limited to ordering letters of the alphabet
Alphabet

An alphabet is a standardized set of letter basic written symbols each of which roughly represents a phoneme, a spoken language, either as it exists now or as it was in the past....
. Collating lists of words or names into alphabetical order is the basis of most office filing systems, library catalog
Library catalog

A library catalog is a register of all bibliography items found in a library or group of libraries, such as a network of libraries at several locations....
s and reference books.

Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the ordering of those categories.

Advantages of sorted lists include:
  • one can easily find the first n elements (e.g. the 5 smallest countries) and the last n elements (e.g. the 3 largest countries)
  • one can easily find the elements in a given range (e.g. countries with an area between .. and .. square km)
  • one can easily search for an element, and conclude whether it is in the list, e.g. with the binary search algorithm
    Binary search algorithm

    In computer science, a binary search algorithm is a technique for locating a particular value in a Collation. The method makes progressively better guesses, and closes in on the location of the sought value by selecting the middle element in the span , comparing its value to the target value, and determining if it is greater than, less than,...
     or interpolation search
    Interpolation search

    Interpolation search is an algorithm for Search algorithm for a given key value in an indexed array that has been Collation by the values of the key....
     either automatically, or, roughly and perhaps unconsciously, manually.


A collation algorithm, e.g. the "Unicode collation algorithm
Unicode collation algorithm

The Unicode collation algorithm provides a standard way to put names, words or strings of text in sequence according to the needs of a particular situation....
", differs from a sorting algorithm
Sorting algorithm

In computer science and mathematics, a sorting algorithm is an algorithm that puts elements of a List in a certain Total order. The most-used orders are numerical order and lexicographical order....
: the first is a process to define the order, which corresponds to the process of just comparing two values, while a sorting algorithm is a procedure to put a list of items in this order.

Collation defines a on the set of possible items, typically by defining a total order
Total order

In mathematics and set theory, a total order, linear order, simple order, or ordering is a binary relation on some Set X....
 on a . Note however that in the case of e.g. numerical sorting of strings representing numbers, the strings are only partially preordered, because e.g. 2e3 and 2000 have the same ranking, and 2 and 2.0 also. The numbers represented by the strings are totally ordered.

Collation systems


Numerical sorting, sorting of single characters

One collation system is numerical sorting. For example, the list of numbers 4 · 17 · 3 · -5 collates to -5 · 3 · 4 · 17.

While this might appear to work only for numbers, computer
Computer

A computer is a machine that manipulates Data according to a list of Code .The first devices that resemble modern computers date to the mid-20th century , although the computer concept and various machines similar to computers existed earlier....
s can use this method for any textual information since computers internally use character sets which assign a numeric code point to each letter or glyph
Glyph

A glyph is an element of writing. Two or more glyphs representing the same symbol, whether interchangeable or context-dependent, are called allographs; the abstract unit they are variants of is called a grapheme or character ....
. For example, a computer using ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 code (or any of its superset
SuperSet

SuperSet Software was a group founded by friends and former Eyring Research Institute co-workers Drew Major, Dale Neibaur, Kyle Powell and later joined by Mark Hurst....
s such as Unicode
Unicode

Unicode is a computing industry standard allowing computers to consistently represent and manipulate Character expressed in most of the world's writing systems....
) and numerical sorting would collate the list of characters a · b · C · d · $ to $ · C · a · b · d.

The numerical values that ASCII uses are $ = 36, a = 97, b = 98, C = 67, and d = 100, resulting in what is called "".

This style of collation is commonly used, often with the refinement of converting uppercase letters to lowercase before comparing ASCII values, since most people do not expect capitalised words to jump the head of the list.

Alphabetical order

A collation system for multiple-character words is alphabetical order, based on the conventional order of letters in an alphabet
Alphabet

An alphabet is a standardized set of letter basic written symbols each of which roughly represents a phoneme, a spoken language, either as it exists now or as it was in the past....
 or abjad
Abjad

An abjad is a type of writing system in which each symbol stands for a consonant; the reader must supply the appropriate vowel. It is a term suggested by Peter T....
 (most of which have a single conventional order).

Each
nth letter is compared with the nth letter of other words in the list, starting at the first letter of each word and advancing to the second, third, fourth, and so on, until the order is established.

The order of the Latin alphabet
Latin alphabet

The Latin alphabet, also called the Roman alphabet, is the most widely used alphabetic writing system in the world today. It evolved from the western variety of the Greek alphabet called the Cumae alphabet, and was initially developed by the Ancient Romes to write the Latin....
 is



The principle behind extending alphabetical order to words (lexicographical order
Lexicographical order

In mathematics, the lexicographic or lexicographical order, , is a natural order theory structure of the Cartesian product of two ordered sets....
) is that all words in a list beginning with the same letter should be grouped together; within a grouping starting with a single letter, all words beginning with the same two letters shall be grouped together; and so on, maximizing the number of common initial letters between adjacent words. The ordering principle is applied at the point where the letters differ. For instance, in the sequence:

Astrolabe
Astronomy
Astrophysics


The order of the words is given according to the first letter of the words that is different from the others (shown in bold). Since
n follows l in the alphabet, but precedes p,
Astronomy comes after Astrolabe, but before Astrophysics.

There has historically been some variation in the application of these rules. For instance, the prefixes
Mc and M
in Irish and Scottish surnames were taken to be abbreviations for Mac, and alphabetized as if they were spelled out as Mac in full. Thus one might find in a catalog the sequence:

McKinley
Mackintosh


with McKinley preceding Mackintosh, as if it had been spelled "MacKinley". Since the advent of computer-sorted lists, this type of alphabetization is less frequently encountered, though it is still used in British phone books. A variation in alphabetical principles applies to names consisting of two words. In some cases, names with identical first words are all alphabetized together under the first word, e.g., grouping together all names beginning with San, all those beginning with Santa, and those beginning with Santo:

San
San Cristobal
San Juan
San Teodoro
San Tomas
Santa Barbara
Santa Clara
Santa Cruz
Santo Domingo


But in another system, the names are alphabetized as if they had no spaces, e.g. as follows:

San
San Cristobal
San Juan
Santa Barbara
Santa Clara
Santa Cruz
San Teodoro
Santo Domingo
San Tomas


The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet
Latin alphabet

The Latin alphabet, also called the Roman alphabet, is the most widely used alphabetic writing system in the world today. It evolved from the western variety of the Greek alphabet called the Cumae alphabet, and was initially developed by the Ancient Romes to write the Latin....
. For example, the 29-letter alphabet of Spanish
Spanish language

Spanish or Castilian is a Romance languages that originated in northern Spain, and gradually spread in the Kingdom of Castile and evolved into the principal language of government and trade....
 treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph
Digraph (orthography)

A digraph, bigraph , or digram is a pair of characters used to write one phoneme or a sequence of phonemes that does not correspond to the normal values of the two characters combined....
 rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.

Similar differences between computer numeric sorting and alphabetic sorting occur in Danish
Danish language

Danish is one of the North Germanic languages , a sub-group of the Germanic languages branch of the Indo-European languages. It is spoken by around 6 million people, mainly in Denmark; the language is also used by the 50,000 Danes in the northern parts of Schleswig-Holstein in Germany where it holds the status of minority language....
 and Norwegian
Norwegian language

Norwegian is a North Germanic languages language spoken primarily in Norway, where it is an official language. It is also spoken as a second language among Norwegian-Americans in the United States of America, especially in the central northern states....
 (aa is ordered at the end of the alphabet when it is pronounced like å
Å

The Letter ? represents various sounds in the Swedish alphabet, Finnish alphabet , Danish alphabet, Norwegian alphabet, North Frisian language, Walloon language, Chamorro language, and Istro-Romanian language alphabets....
, and at the start of the alphabet when it is pronounced like a), German
German language

German is a West Germanic languages, thus related to and classified alongside English language and Dutch language. It is one of the world's world language and the most widely spoken mother tongue in the European Union....
 (ß is ordered as s + s; ä, ö, ü are ordered as a + e, o + e, u + e in phone books, but as o elsewhere, and behind o in Austria), Icelandic
Icelandic language

Icelandic is a North Germanic languages, the language of Iceland. Its closest relative is Faroese language and Norwegian dialects such as Telemark dialect and Sognam?l....
 (ð follows d), Dutch
Dutch language

Dutch is a West Germanic languages spoken by over 22 million people as a first language, and about 5 million people as a second language."1% of the EU population claims to speak Dutch well enough in order to have a conversation." Outside the European Union the number of second language speakers of Dutch is very small. Most native...
 (ij is sometimes ordered as y; see IJ: Collation), English (æ is ordered as a + e), and many other languages.

Usually the space
Space (punctuation)

In writing, a space is a blank area that is devoid of content, which word divider, letters, numbers, and punctuation. Conventions for interword separation and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....
s or hyphen
Hyphen

A hyphen is a punctuation mark. It is used both to join words and also to separate syllables of a single word. It is often confused with the dash , which are longer and have different uses, and with the minus sign which is also longer....
s between words are ignored.

Languages that used a syllabary
Syllabary

A syllabary is a set of written symbols that represent syllables, which make up words. A symbol in a syllabary typically represents an optional consonant sound followed by a vowel sound....
 or abugida
Abugida

An 'abugida' is a segment writing system which is based on consonants but in which vowel notation is obligatory. About half the writing systems in the world are abugidas, including the extensive Brahmic family of scripts used in South and Southeast Asia....
 instead of an alphabet (for example, Cherokee
Cherokee language

Cherokee is an Iroquoian languages spoken by the Cherokee people which uses a Cherokee syllabary writing system. It is the only Southern Iroquoian languages language that remains spoken....
) can use approximately the same system if there is a set ordering for the symbols.

Radical-and-stroke sorting


Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as Chinese
Chinese language

Chinese or the Sinitic language is a language family consisting of language mutually unintelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the two branches of Sino-Tibetan languages of languages....
 hanzi and Japanese
Japanese language

IPA: [n?iho?go] is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is related to the Ryukyuan languages....
 kanji
Kanji

are the Chinese characters that are used in the modern Japanese language logogram along with hiragana , katakana , Arabic numerals, and the occasional use of the Latin alphabet....
, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals
Radical (Chinese character)

[Image:Chinese character ? cai3 pick with ROOT colored.gif|right|thumb|The Chinese character ? cai, meaning ?to pick?, with its ?root?, the original, semantic graph on the right, colored red; and its later-added, redundant semantic determinative The semantic root ....
 in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character for "mother" is sorted as a thirteen-stroke character under the three-stroke primary radical.

The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word 'Tokyo, the Japanese name of Tokyo
Tokyo

, officially , is one of the 47 prefectures of Japan of Japan and located on the eastern side of the main island Honshu. The twenty-three special wards of Tokyo, each governed as a city, cover the area that was once the Tokyo City in the eastern part of the prefecture, and total over 8 million people....
 can be sorted as if it were spelled out in the Japanese characters of the hiragana
Hiragana

is a Japanese language syllabary, one component of the Japanese writing system, along with katakana, kanji, and the romanization of Japanese. Hiragana and katakana are both kana systems, in which each symbol represents one mora ....
 syllabary as "to-u-ki-yo-u", using the conventional sorting order for these characters.

Nevertheless, the radical-and-stroke system is the only practical method for constructing dictionaries that someone may use to look up a logograph whose pronunciation is unknown.

In addition, in Greater China, surname stroke order
Surname stroke order

The surname stroke order arose as an impartial method of categorization of the order in which names appear in official documentation or in ceremonial procedure without any line of hierarchy....
ing is a convention in some official documentations where peoples' names are listed without hierarchy.

Multilingual ordering

When lists of names or words need to be ordered, but the context does not define a particular single language or alphabet, the Unicode Collation Algorithm
Unicode collation algorithm

The Unicode collation algorithm provides a standard way to put names, words or strings of text in sequence according to the needs of a particular situation....
 provides a way to put them in sequence.

Complications


Conventions in typography and in sorting systems

In typography and in the writing of scientific articles etc, such things as headers, sections, lists, pages etc. might use alphabetical numbering instead of numerical numbering. However, this does not always mean that the full alphabet of a particular language is used. Often alphabetical numbering—or enumeration—only uses a subset of the full alphabet. E.g. the Russian alphabet has 33 letters, but typically only 28 are used in typographical enumeration (and for instance Ukrainian, Belarusian and Bulgarian Cyrillic enumeration shows similar features). Two Russian letters, ?
?

or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and Lower case forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet, but has the capital form majuscule , based on a horizontally flipped majuscule E....
 and ?
?

or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and Lower case forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet, but has the capital form majuscule , based on a horizontally flipped majuscule E....
, are only used for modifying the preceding consonants—they naturally fall out. The last three could have been used, but mostly are not: ?
?

or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and Lower case forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet, but has the capital form majuscule , based on a horizontally flipped majuscule E....
 never begins a Russian word, ?
?

or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and Lower case forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet, but has the capital form majuscule , based on a horizontally flipped majuscule E....
 almost never begins a word either, and it is perhaps too much alike the ?
?

or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and Lower case forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet, but has the capital form majuscule , based on a horizontally flipped majuscule E....
—and also a relatively new character. ?
?

or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and Lower case forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet, but has the capital form majuscule , based on a horizontally flipped majuscule E....
 is also relatively new and much debated—sometimes in proper alphabetical sorting letters on ? are listed under ?
?

or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and Lower case forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet, but has the capital form majuscule , based on a horizontally flipped majuscule E....
. (These "rules" are of course moderated, again, e.g. in phone catalogs, where foreign (non-Russian) names may frequently begin with ? or ?.) This alludes to a simple fact: alphabets are not only tools for writing. And letters are often kept in an alphabet of a certain language even though they are not used in writing, not least because they are used in alphabetical enumeration. For instance, X
X

X is the twenty-fourth letter in the modern Latin alphabet. Its name in English language is spelled ex , plural exes .History...
, W
W

W is the 23 letter in the Latin alphabet. Its name in English language is spelled double-u ....
, Z
Z

Z is the twenty-sixth and final Letter of the modern English alphabet....
 are not used in writing the Norwegian language, except in loanwords and names. Still they are kept in the Norwegian alphabet, and used in alphabetical lists. Likewise, earlier versions of the Russian alphabet
Russian alphabet

The modern Russian alphabet is a variant of the Cyrillic alphabet. It was introduced into Kievan Rus' at the time of Vladimir I of Kiev's conversion to Christianity date....
 contained letters which only had two purposes: they were good for writing Greek words and for using the Greek counting system in its Cyrillic form.

Compound words and special characters

A complication in alphabetical sorting can arise due to disagreements over how groups of words (separated compound word
Compound (linguistics)

In linguistics, a compound is a lexeme that consists of more than one Word stem. Compounding or composition is the word-formation that creates compound lexemes ....
s, name
Name

A name is a label for a noun, , normally used to distinguish one from another. Names can identify a class or Category of things, or a single thing, either uniquely, or within a given wiktionary:context....
s, title
Title

A title is a Prefix or Suffix added to a person's name to signify either veneration, an official position or a professional or academic qualification....
s, etc.) should be ordered. One rule is to remove spaces for purposes of ordering, another is to consider a space
Space (punctuation)

In writing, a space is a blank area that is devoid of content, which word divider, letters, numbers, and punctuation. Conventions for interword separation and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....
 as a character that is ordered before numbers and letters (this method is consistent with ordering by ASCII or Unicode codepoint), and a third is to order a space after numbers and letters. Given the following strings to alphabetize—"catch", "cattle", "cat food"—the first rule produces "catch" "cat food" "cattle", the second "cat food" "catch" "cattle", and the third "catch" "cattle" "cat food". The first rule is used in many (but not all) dictionaries
Dictionary

A dictionary is a book of Alphabetical order listed words in a specific language, with definitions, etymologies, pronunciations, and other information; or a book of alphabetically listed words in one language with their equivalents in another, also known as a lexicon....
, the second in telephone directories
Telephone directory

A telephone directory is a listing of telephone subscribers in a geographical area or subscribers to services provided by the organization that publishes the directory....
 (so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo). The third rule is rarely used.

A similar complication arises when special characters such as hyphen
Hyphen

A hyphen is a punctuation mark. It is used both to join words and also to separate syllables of a single word. It is often confused with the dash , which are longer and have different uses, and with the minus sign which is also longer....
s or apostrophe
Apostrophe

The apostrophe is a punctuation mark, and sometimes a diacritic mark, in languages that use the Latin alphabet or certain other alphabets. In English it has two main functions: it marks omissions, and it assists in marking the possessives of all nouns and many pronouns....
s appear in words or names. Any of the same rules as above can be used in this case as well; however, the strict ASCII sorting no longer corresponds exactly to any of the rules.

Name/surname ordering

The telephone directory example sheds light on another complication. In cultures where family name
Family name

A family name or last name is a type of surname and part of a personal name indicating the family to which the person belongs. The use of family names is widespread in cultures around the world....
s are written after given name
Given name

A given name is a personal name that specifies and differentiates between members of a group of individuals, especially in a family, all of whose members usually share the same family name ....
s, it is usually still desired to sort by family name first. In this case, names need to be reordered to be sorted properly. For example, Juan Hernandes and Brian O'Leary should be sorted as Hernandes, Juan and O'Leary, Brian even if they are not written this way. Capturing this rule in a computer collation algorithm is difficult, and simple attempts will necessarily fail. For example, unless the algorithm has at its disposal an extensive list of family names, there is no way to decide if "Gillian Lucille van der Waal" is "van der Waal, Gillian Lucille", "Waal, Gillian Lucille van der", or even "Lucille van der Waal, Gillian".

Abbreviations and common words

When abbreviations are used, it is sometimes desired to expand the abbreviations for sorting. In this case, "St. Paul" comes before "Shanghai". Obviously, to capture this behavior in a collation algorithm, we need a list of abbreviations. It may be more practical in some cases to store two sets of strings, one for sorting and one display. A similar problem arises when letters are replaced by numbers or special symbols in an irregular manner, for example 1337 for leet
Leet

l33t or Eleet , also known as Leetspeak, is an alphabet used primarily on the Internet, which uses various combinations of ASCII characters to replace Latin alphabet letters....
 or the movie
Se7en. In this case, proper sorting necessitates keeping two sets of strings.

In certain contexts, very common words (such as article
Article (grammar)

An article is a word that combines with a noun to indicate the types of reference being made by the noun, and to specify the volume or numerical scope of that reference....
s) at the beginning of a sequence of words are not considered for ordering, or are moved to the end. So "The Shining
The Shining (novel)

The Shining is a horror fiction novel by United States author Stephen King. The title was inspired by the John Lennon song "Instant Karma!", which contained the line "We all shine on?"....
" is considered "Shining" or "Shining, The" when alphabetizing and therefore is ordered before "Summer of Sam
Summer of Sam

Summer of Sam is a 1999 in film crime film-drama film based around the Son of Sam serial murders. It was directed and produced by Spike Lee....
". This rule is fairly easy to capture in an algorithm, but many programs rely instead on simple lexicographic ordering. One fairly quaint exception to this rule is the flying of the flag of The Former Yugoslav Republic of Macedonia
Republic of Macedonia

The Republic of Macedonia , , often referred to simply as Macedonia, is a landlocked country on the Balkans in southeastern Europe. It is bordered by Serbia to the north, Bulgaria to the east, Greece to the south and Albania to the west....
 at the United Nations
United Nations

The United Nations is an international organization whose stated aims are to facilitate cooperation in international law, international security, economic development, Social change, human rights and achieving world peace....
 between those of Thailand
Thailand

The Kingdom of Thailand is an independent country that lies in the heart of Southeast Asia. It is bordered to the north by Laos and Myanmar, to the east by Laos and Cambodia, to the south by the Gulf of Thailand and Malaysia, and to the west by the Andaman Sea and Myanmar....
 and Timor Leste
East Timor

East Timor, also known as Timor-Leste is a country in Southeast Asia. It comprises the eastern half of the island of Timor, the nearby islands of Atauro Island and Jaco , and Oecussi-Ambeno, an exclave on the northwestern side of the island, within Indonesian West Timor....
.

Sorting of numbers

Ascending order of numbers differs from alphabetical order, e.g. 11 comes alphabetically before 2. This can be fixed with leading zero
Leading zero

A leading zero is any 0 that leads a number string with a non-zero value. For example, James Bond's famous identifier, 007, has two leading zeros....
s: 02 comes alphabetically before 11. See e.g. ISO 8601
ISO 8601

ISO 8601 is an international standard for calendar date and time representations issued by the International Organization for Standardization . Specifically, the standard is titled "Data elements and interchange formats ? Information interchange ? Representation of dates and times"....
.

Also -13 comes alphabetically after -12 although it is less. With negative numbers, to make ascending order correspond with alphabetical sorting, more drastic measures are needed such as adding a constant to all numbers to make them all positive.

Numerical sorting of strings

Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in Unicode. This can be extended to Roman numerals. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example, Windows XP
Windows XP

Windows XP is a line of operating systems produced by Microsoft for use on personal computers, including home and business desktops, laptop, and media centers....
 does this when sorting file names.

Sorting decimals properly is a bit more difficult, due to the fact that different locales use different symbols for a decimal point
Decimal separator

In a Positional notation numeral system, the decimal separator is a symbol used to mark the boundary between the integer and the fraction parts of a decimal numeral....
, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.

Alphabetical sorting of numbers

When numbers are used as names, rather than for their numerical properties, it is common to sort them alphabetically as they would be spelled. For example, the movie
1776
1776 (film)

1776 is a 1972 in film United States musical film directed by Peter H. Hunt. The screenplay by Peter Stone was adapted from his libretto for the 1776 ....
 would be between Seve Ballesteros and Severus Snape
Severus Snape

Severus Snape is a fictional character in the Harry Potter book series written by J. K. Rowling. In the first novel, Harry Potter and the Philosopher's Stone, he is one of the primary antagonists....
. If a number is in a foreign term, it is alphabetized as it would be spelled in that language; for example, 24 heures du Mans would be between Vinge's Singularity and Vinh Airport
Vinh Airport

Vinh Airport is located in Vinh city of Nghe An Province northern Vietnam. It is a mixed military/civil airport. It used to be one of the two major military airbases in Vietnam besides Gia Lam Airbase in Hanoi....
, reflecting the French "vingt quatre".

External links and references

  • : Unicode Technical Standard #10
  • , as proposed in the List module of Cascading Style Sheets
    Cascading Style Sheets

    Cascading Style Sheets is a stylesheet language used to describe the presentation of a document written in a markup language. Its most common application is to style web pages written in HTML and XHTML, but the language can be applied to any kind of XML document, including Scalable Vector Graphics and XUL....
    .
  • : Charts demonstrating language-specific sorting orders in various operating systems and DBMS


Tools

  • The GNU implementation of the standard Unix sort utility.
  • A sort program that provides an unusual level of flexibility in defining collations and extracting keys.
  • An online demonstration of the Unicode Collation Algorithm
    Unicode collation algorithm

    The Unicode collation algorithm provides a standard way to put names, words or strings of text in sequence according to the needs of a particular situation....
     using International Components for Unicode
    International Components for Unicode

    International Components for Unicode is an open source project of mature C /C++ and Java libraries for Unicode support, software internationalization and software globalization....