Punycode - AbsoluteAstronomy.com

In computing, Punycode is an instance of a general encoding syntax (Bootstring) by which a string of Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

characters is transformed uniquely and reversibly into a smaller, restricted character set

Character encoding

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

.

Punycode is intended for the encoding of labels in the Internationalized Domain Names in Applications

Internationalized domain name

An internationalized domain name is an Internet domain name that contains at least one label that is displayed in software applications, in whole or in part, in a language-specific script or alphabet, such as Arabic, Chinese, Russian, Hindi or the Latin alphabet-based characters with diacritics,...

(IDNA) framework, such that these domain names may be represented in the ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

character set allowed in the Domain Name System

Domain name system

The Domain Name System is a hierarchical distributed naming system for computers, services, or any resource connected to the Internet or a private network. It associates various information with domain names assigned to each of the participating entities...

of the Internet

Internet

The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

. The encoding syntax is defined in IETF document RFC 3492.

The IDNA methodology encodes only select label components of domain names with a procedure called ToASCII. The procedure ToUnicode decodes the DNS label into Unicode representation.

Encoding procedure

This section demonstrates the procedure for Punycode encoding, using the example of the string "bücher" (German

German language

German is a West Germanic language, related to and classified alongside English and Dutch. With an estimated 90 – 98 million native speakers, German is one of the world's major languages and is the most widely-spoken first language in the European Union....

for books), which is translated into the label "bcher-kva".

Separation of ASCII characters

First, all basic (ASCII

ASCII

) characters in the string are copied directly from input to output, skipping over other characters (e.g., "bücher" → "bcher"). If one or more non-basic characters were skipped, an ASCII hyphen is added to the output next (e.g., "bücher" → "bcher-"). Since it is a basic character, the ASCII hyphen may still appear in the string before this additional character, but the addition does not cause ambiguity—no later part of the encoding process will introduce another "-", so the last "-" (if any) is always the one that signifies the end of the basic characters.

Encoding of non-ASCII character insertions as code numbers

The next part of the encoding process first requires an understanding of the decoder, which is a finite-state machine with two state variables i and n. i is an index into the string ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end).

i starts at zero while n starts at 128 (the first non-ASCII code point). The state progression is a monotonic function

Monotonic function

In mathematics, a monotonic function is a function that preserves the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of order theory....

. A state change either increments i or if i is at its maximum resets i to zero and increments n. At each state change either the code point denoted by n is inserted or it is not inserted.

The code numbers generated by the encoder represent how many possibilities the decoder should skip before an insertion is made. "ü" has code point 252. So before we get to the possibility of inserting ü in position one it is necessary to skip over six (there are five characters in "bcher" giving six insertion positions) potential insertions of each of the 124 preceding non-ASCII code points (252 - 128, the upper limit of ASCII) and one possible insertion (at position zero) of code point 252. That is why it is necessary to tell the decoder to skip a total of (6 × 124) + 1 = 745 possible insertions before getting to the one required.

Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits vary.

In this case a number system with 36 digits is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0".

To decode this string of digits, the threshold starts out as 1 and the weight is 1. The first digit is the units digit; 10 with a weight of 1 equals 10. After this, the threshold value is adjusted. For the sake of simplicity, let's assume it is now 2. The second digit has a weight of 36 minus the previous threshold value, in this case, 35. Therefore the sum of the first two "digits" is 10 × 1 + 21 × 35. Since the second "digit" is not less than the threshold value of 2, there is more to come. The weight for the third "digit" is the previous weight times 36 minus the new threshold value; 35 × 34. The third "digit" in this example is 0, which is less than 2, meaning that it is the last (most significant) part of the number. Therefore "kva" represents the number 10 × 1 + 21 × 35 + 0 × 35 × 34 = 745.

The threshold itself is determined by an algorithm keeping it between 1 and 26 inclusive, meaning the last character of an encoding will always be alphabetic. The case can then be used to provide information about the original case of the string.

For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf", etc.

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep

Nameprep

Nameprep is the process of case-folding to lowercase and removal of some generally invisible code points before it is suitable to represent a domain name, or other such canonical name...

and (for top-level domain

Top-level domain

A top-level domain is one of the domains at the highest level in the hierarchical Domain Name System of the Internet. The top-level domain names are installed in the root zone of the name space. For all domains in lower levels, it is the last part of the domain name, that is, the last label of a...

s) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.

External links

Punycode encoding and decoding
C source code from above paper
ICU IDNA Demonstration An online demonstration of how ICU
International Components for Unicode
International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

performs IDN operations
List of TLDs considered by the Mozilla developers to have an effective anti-spoofing policy for name registration
IDN and Punycode in IE7
Simple Punycode converter

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.