Percent-encoding
Encyclopedia
Percent-encoding, also known as URL encoding, is a mechanism for encoding
Code
A code is a rule for converting a piece of information into another form or representation , not necessarily of the same type....

 information in a Uniform Resource Identifier
Uniform Resource Identifier
In computing, a uniform resource identifier is a string of characters used to identify a name or a resource on the Internet. Such identification enables interaction with representations of the resource over a network using specific protocols...

 (URI) under certain circumstances. Although it is known as URL encoding it is, in fact, used more generally within the main Uniform Resource Identifier
Uniform Resource Identifier
In computing, a uniform resource identifier is a string of characters used to identify a name or a resource on the Internet. Such identification enables interaction with representations of the resource over a network using specific protocols...

 (URI) set, which includes both Uniform Resource Locator
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

 (URL) and Uniform Resource Name
Uniform Resource Name
A uniform resource name is a uniform resource identifier that uses the urn scheme and does not imply availability of the identified resource. Both URNs and URLs are URIs, and a particular URI may be a name and a locator at the same time.The functional requirements for uniform resource names are...

 (URN). As such it is also used in the preparation of data of the "application/x-www-form-urlencoded" media type
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

, as is often used in the submission of HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 form data in HTTP requests.

Types of URI characters

The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are those characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of a URL (or more generally, a URI). Unreserved characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.
RFC 3986 section 2.2 Reserved Characters (January 2005)
!
Exclamation mark
The exclamation mark, exclamation point, or bang, or "dembanger" is a punctuation mark usually used after an interjection or exclamation to indicate strong feelings or high volume , and often marks the end of a sentence. Example: “Watch out!” The character is encoded in Unicode at...

*
Asterisk
An asterisk is a typographical symbol or glyph. It is so called because it resembles a conventional image of a star. Computer scientists and mathematicians often pronounce it as star...

' (
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

)
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

;
Semicolon
The semicolon is a punctuation mark with several uses. The Italian printer Aldus Manutius the Elder established the practice of using the semicolon to separate words of opposed meaning and to indicate interdependent statements. "The first printed semicolon was the work of ... Aldus Manutius"...

:
Colon (punctuation)
The colon is a punctuation mark consisting of two equally sized dots centered on the same vertical line.-Usage:A colon informs the reader that what follows the mark proves, explains, or lists elements of what preceded the mark....

@
At sign
The at sign , also called the ampersat, apetail, arroba, atmark, at symbol, commercial at or monkey tail, is formally an abbreviation of the accounting and commercial invoice term "at the rate of"...

&
Ampersand
An ampersand is a logogram representing the conjunction word "and". The symbol is a ligature of the letters in et, Latin for "and".-Etymology:...

= + $
Dollar sign
The dollar or peso sign is a symbol primarily used to indicate the various peso and dollar units of currency around the world.- Origin :...

,
Comma (punctuation)
The comma is a punctuation mark. It has the same shape as an apostrophe or single closing quotation mark in many typefaces, but it differs from them in being placed on the baseline of the text. Some typefaces render it as a small line, slightly curved or straight but inclined from the vertical, or...

/
Slash (punctuation)
The slash is a sign used as a punctuation mark and for various other purposes. It is now often called a forward slash , and many other alternative names.-History:...

?
Question mark
The question mark , is a punctuation mark that replaces the full stop at the end of an interrogative sentence in English and many other languages. The question mark is not used for indirect questions...

#
Number sign
Number sign is a name for the symbol #, which is used for a variety of purposes including, in some countries, the designation of a number...

[
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

]
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...



RFC 3986 section 2.3 Unreserved Characters (January 2005)
A
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...

B
B
B is the second letter in the basic modern Latin alphabet. It is used to represent a variety of bilabial sounds , most commonly a voiced bilabial plosive.-History:...

C
C
Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...

D
D
D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...

E
E
E is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...

F
F
F is the sixth letter in the basic modern Latin alphabet.-History:The origin of ⟨f⟩ is the Semitic letter vâv that represented a sound like or . Graphically, it originally probably depicted either a hook or a club...

G
G
G is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...

H
H
H .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....

I
I
I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...

J
J
Ĵ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...

K
K
K is the eleventh letter of the English and basic modern Latin alphabet.-History and usage:In English, the letter K usually represents the voiceless velar plosive; this sound is also transcribed by in the International Phonetic Alphabet and X-SAMPA....

L
L
Ł or ł, described in English as L with stroke, is a letter of the Polish, Kashubian, Sorbian, Łacinka , Łatynka , Wilamowicean, Navajo, Dene Suline, Inupiaq, Zuni, Hupa, and Dogrib alphabets, several proposed alphabets for the Venetian language, and the ISO 11940 romanization of the Thai alphabet...

M
M
M is the thirteenth letter of the basic modern Latin alphabet.-History:The letter M is derived from the Phoenician Mem, via the Greek Mu . Semitic Mem probably originally pictured water...

N
N
N is the fourteenth letter in the basic modern Latin alphabet.- History of the forms :One of the most common hieroglyphs, snake, was used in Egyptian writing to stand for a sound like English ⟨J⟩, because the Egyptian word for "snake" was djet...

O
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...

P
P
P is the sixteenth letter of the basic modern Latin alphabet.-Usage:In English and most other European languages, P is a voiceless bilabial plosive. Both initial and final Ps can be combined with many other discrete consonants in English words...

Q
Q
Q is the seventeenth letter of the basic modern Latin alphabet.- History :The Semitic sound value of Qôp was , a sound common to Semitic languages, but not found in English or most Indo-European ones...

R
R
R is the eighteenth letter of the basic modern Latin alphabet.-History:The original Semitic letter may have been inspired by an Egyptian hieroglyph for tp, "head". It was used for by Semites because in their language, the word for "head" was rêš . It developed into Greek Ρ and Latin R...

S
S
S is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...

T
T
T is the 20th letter in the basic modern Latin alphabet. It is the most commonly used consonant and the second most common letter in the English language.- History :Taw was the last letter of the Western Semitic and Hebrew alphabets...

U
U
U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....

V
V
V is the twenty-second letter in the basic modern Latin alphabet.-Letter:The letter V comes from the Semitic letter Waw, as do the modern letters F, U, W, and Y. See F for details....

W
W
W is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...

X
X
X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...

Y
Y
Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...

Z
Z
Z is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...

a
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...

b
B
B is the second letter in the basic modern Latin alphabet. It is used to represent a variety of bilabial sounds , most commonly a voiced bilabial plosive.-History:...

c
C
Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...

d
D
D is the fourth letter in the basic modern Latin alphabet.- History :The Semitic letter Dâlet may have developed from the logogram for a fish or a door. There are various Egyptian hieroglyphs that might have inspired this. In Semitic, Ancient Greek, and Latin, the letter represented ; in the...

e
E
E is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...

f
F
F is the sixth letter in the basic modern Latin alphabet.-History:The origin of ⟨f⟩ is the Semitic letter vâv that represented a sound like or . Graphically, it originally probably depicted either a hook or a club...

g
G
G is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...

h
H
H .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....

i
I
I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...

j
J
Ĵ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...

k
K
K is the eleventh letter of the English and basic modern Latin alphabet.-History and usage:In English, the letter K usually represents the voiceless velar plosive; this sound is also transcribed by in the International Phonetic Alphabet and X-SAMPA....

l
L
Ł or ł, described in English as L with stroke, is a letter of the Polish, Kashubian, Sorbian, Łacinka , Łatynka , Wilamowicean, Navajo, Dene Suline, Inupiaq, Zuni, Hupa, and Dogrib alphabets, several proposed alphabets for the Venetian language, and the ISO 11940 romanization of the Thai alphabet...

m
M
M is the thirteenth letter of the basic modern Latin alphabet.-History:The letter M is derived from the Phoenician Mem, via the Greek Mu . Semitic Mem probably originally pictured water...

n
N
N is the fourteenth letter in the basic modern Latin alphabet.- History of the forms :One of the most common hieroglyphs, snake, was used in Egyptian writing to stand for a sound like English ⟨J⟩, because the Egyptian word for "snake" was djet...

o
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...

p
P
P is the sixteenth letter of the basic modern Latin alphabet.-Usage:In English and most other European languages, P is a voiceless bilabial plosive. Both initial and final Ps can be combined with many other discrete consonants in English words...

q
Q
Q is the seventeenth letter of the basic modern Latin alphabet.- History :The Semitic sound value of Qôp was , a sound common to Semitic languages, but not found in English or most Indo-European ones...

r
R
R is the eighteenth letter of the basic modern Latin alphabet.-History:The original Semitic letter may have been inspired by an Egyptian hieroglyph for tp, "head". It was used for by Semites because in their language, the word for "head" was rêš . It developed into Greek Ρ and Latin R...

s
S
S is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...

t
T
T is the 20th letter in the basic modern Latin alphabet. It is the most commonly used consonant and the second most common letter in the English language.- History :Taw was the last letter of the Western Semitic and Hebrew alphabets...

u
U
U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....

v
V
V is the twenty-second letter in the basic modern Latin alphabet.-Letter:The letter V comes from the Semitic letter Waw, as do the modern letters F, U, W, and Y. See F for details....

w
W
W is the 23rd letter in the basic modern Latin alphabet.In other Germanic languages, including German, its pronunciation is similar or identical to that of English V...

x
X
X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...

y
Y
Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...

z
Z
Z is the twenty-sixth and final letter of the basic modern Latin alphabet.-Name and pronunciation:In most dialects of English, the letter's name is zed , reflecting its derivation from the Greek zeta but in American English, its name is zee , deriving from a late 17th century English dialectal...

0
0 (number)
0 is both a numberand the numerical digit used to represent that number in numerals.It fulfills a central role in mathematics as the additive identity of the integers, real numbers, and many other algebraic structures. As a digit, 0 is used as a placeholder in place value systems...

1 2 3 4 5 6 7 8 9 -
Hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. The hyphen should not be confused with dashes , which are longer and have different uses, or with the minus sign which is also longer...

Underscore
The underscore [ _ ] is a character that originally appeared on the typewriter and was primarily used to underline words...

.
Full stop
A full stop is the punctuation mark commonly placed at the end of sentences. In American English, the term used for this punctuation is period. In the 21st century, it is often also called a dot by young people...

~
Tilde
The tilde is a grapheme with several uses. The name of the character comes from Portuguese and Spanish, from the Latin titulus meaning "title" or "superscription", though the term "tilde" has evolved and now has a different meaning in linguistics....



Other characters in a URI must be percent encoded.

Percent-encoding reserved characters

When a character from the reserved set (a "reserved character") has special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded. Percent-encoding a reserved character involves converting the character to its corresponding byte value in ASCII and then representing that value as a pair of hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

 digits. The digits, preceded by a percent sign
Percent sign
The percent sign is the symbol used to indicate a percentage .Related signs include the permille sign ‰ and the permyriad sign , which indicate that a number is divided by one thousand or ten thousand respectively...

 ("%"), are then used in the URI in place of the reserved character.
(For a non-ASCII character, it is typically converted to its byte sequence in UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, and then each byte value is represented as above.)

The reserved character "/", for example, if used in the "path" component of a URI, has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, "/" needs to be in a path segment, then the three characters "%2F" or "%2f" must be used in the segment instead of a raw "/".
Reserved characters after percent-encoding
!
Exclamation mark
The exclamation mark, exclamation point, or bang, or "dembanger" is a punctuation mark usually used after an interjection or exclamation to indicate strong feelings or high volume , and often marks the end of a sentence. Example: “Watch out!” The character is encoded in Unicode at...

*
Asterisk
An asterisk is a typographical symbol or glyph. It is so called because it resembles a conventional image of a star. Computer scientists and mathematicians often pronounce it as star...

' (
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

)
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

;
Semicolon
The semicolon is a punctuation mark with several uses. The Italian printer Aldus Manutius the Elder established the practice of using the semicolon to separate words of opposed meaning and to indicate interdependent statements. "The first printed semicolon was the work of ... Aldus Manutius"...

:
Colon (punctuation)
The colon is a punctuation mark consisting of two equally sized dots centered on the same vertical line.-Usage:A colon informs the reader that what follows the mark proves, explains, or lists elements of what preceded the mark....

@ &
Ampersand
An ampersand is a logogram representing the conjunction word "and". The symbol is a ligature of the letters in et, Latin for "and".-Etymology:...

= + $
Dollar sign
The dollar or peso sign is a symbol primarily used to indicate the various peso and dollar units of currency around the world.- Origin :...

,
Comma (punctuation)
The comma is a punctuation mark. It has the same shape as an apostrophe or single closing quotation mark in many typefaces, but it differs from them in being placed on the baseline of the text. Some typefaces render it as a small line, slightly curved or straight but inclined from the vertical, or...

/
Slash (punctuation)
The slash is a sign used as a punctuation mark and for various other purposes. It is now often called a forward slash , and many other alternative names.-History:...

?
Question mark
The question mark , is a punctuation mark that replaces the full stop at the end of an interrogative sentence in English and many other languages. The question mark is not used for indirect questions...

#
Number sign
Number sign is a name for the symbol #, which is used for a variety of purposes including, in some countries, the designation of a number...

[
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

]
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...

%21 %2A %27 %28 %29 %3B %3A %40 %26 %3D %2B %24 %2C %2F %3F %23 %5B %5D


Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not.

In the "query
Query string
In World Wide Web, a query string is the part of a Uniform Resource Locator that contains data to be passed to web applications such as CGI programs....

" component of a URI (the part after a ? character), for example, "/" is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose.

URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting the same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.

Percent-encoding unreserved characters

Characters from the unreserved set never need to be percent-encoded.

URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers shouldn't treat "%41" differently from "A" or "%7E" differently from "~", but some do. For maximum interoperability, URI producers are discouraged from percent-encoding unreserved characters.

Percent-encoding the percent character

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI.

Percent-encoding arbitrary data

Most URI schemes involve the representation of arbitrary data, such as an IP address
IP address
An Internet Protocol address is a numerical label assigned to each device participating in a computer network that uses the Internet Protocol for communication. An IP address serves two principal functions: host or network interface identification and location addressing...

 or file system
File system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

 path, as components of a URI. URI scheme specifications should, but often don't, provide an explicit mapping between URI characters and all possible data values being represented by those characters.

Binary data

Since the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above. Byte value 0F (hexadecimal), for example, should be represented by "%0F", but byte value 41 (hexadecimal) can be represented by "A", or "%41". The use of unencoded characters for alphanumeric and other unreserved characters is typically preferred as it results in shorter URLs.

Character data

The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, stateful
State (computer science)
In computer science and automata theory, a state is a unique configuration of information in a program or machine. It is a concept that occasionally extends into some forms of systems programming such as lexers and parsers....

, and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.

For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all, and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.
Common characters after percent-encoding (ASCII or UTF-8 based)
< > ~
Tilde
The tilde is a grapheme with several uses. The name of the character comes from Portuguese and Spanish, from the Latin titulus meaning "title" or "superscription", though the term "tilde" has evolved and now has a different meaning in linguistics....

.
Full stop
A full stop is the punctuation mark commonly placed at the end of sentences. In American English, the term used for this punctuation is period. In the 21st century, it is often also called a dot by young people...

" {
]]
||
Vertical bar
The vertical bar is a character with various uses in mathematics, where it can be used to represent absolute value, among others; in computing and programming and in general typography, as a divider not unlike the interpunct...

|| \ || -
Dash
A dash is one of several kinds of punctuation mark. Dashes appear similar to hyphens, but differ from them primarily in length, and serve different functions. The most common versions of the dash are the en dash and the em dash .-Common dashes:...

|| `
Grave accent
The grave accent is a diacritical mark used in written Breton, Catalan, Corsican, Dutch, French, Greek , Italian, Mohawk, Norwegian, Occitan, Portuguese, Scottish Gaelic, Vietnamese, Welsh, Romansh, and other languages.-Greek:The grave accent was first used in the polytonic orthography of Ancient...

|| || ^
Caret
Caret usually refers to the spacing symbol ^ in ASCII and other character sets. In Unicode, however, the corresponding character is , whereas the Unicode character named caret is actually a similar but lowered symbol: ....

|| %
Percent sign
The percent sign is the symbol used to indicate a percentage .Related signs include the permille sign ‰ and the permyriad sign , which indicate that a number is divided by one thousand or ten thousand respectively...

|| space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....

 || newline
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...


|-
| %3C || %3E || %7E || %2E || %22 || %7B || %7D || %7C || %5C || %2D || %60 || %5F || %5E || %25 || %20 || %0A or %0D or %0D%0A
|}

Arbitrary character data is sometimes percent-encoded and used in non-URI situations, such as for password obfuscation programs, or other system-specific translation protocols.

Current standard

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary data or as character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.

Non-standard implementations

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The third edition of ECMA-262 still includes an escape(string) function that uses this syntax, but also an encodeURI(uri) function that converts to UTF-8 and percent-encodes each octet.

The application/x-www-form-urlencoded type

When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST
POST (HTTP)
In computing, POST is one of many request methods supported by the HTTP protocol used by the World Wide Web. The POST request method is used when the client needs to send data to the server as part of the request, such as when uploading a file or submitting a completed form.In contrast to the GET...

, or, historically, via email
Email
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

. The encoding used by default is based on a very early version of the general URI percent-encoding rules, with a number of modifications such as newline
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...

 normalization and replacing spaces with "+" instead of "%20". The MIME type of data encoded this way is application/x-www-form-urlencoded, and it is currently defined (still in a very outdated manner) in the HTML and XForms
XForms
XForms is an XML format for the specification of a data processing model for XML data and user interface for the XML data, such as web forms...

 specifications. In addition, the CGI
Common Gateway Interface
The Common Gateway Interface is a standard method for web servers software to delegate the generation of web pages to executable files...

 specification contains rules for how web servers decode data of this type and make it available to applications.

When sent in an HTTP GET request, application/x-www-form-urlencoded data is included in the query component of the request URI. When sent in an HTTP POST
POST (HTTP)
In computing, POST is one of many request methods supported by the HTTP protocol used by the World Wide Web. The POST request method is used when the client needs to send data to the server as part of the request, such as when uploading a file or submitting a completed form.In contrast to the GET...

 request or via email, the data is placed in the body of the message, and the name of the media type is included in the message's Content-Type header.

External links

The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other:
  • RFC 3986 / STD
    Internet standard
    In computer network engineering, an Internet Standard is a normative specification of a technology or methodology applicable to the Internet. Internet Standards are created and published by the Internet Engineering Task Force .-Overview:...

     66 (plus errata), the current generic URI syntax specification.
  • RFC 2396 (obsolete, plus errata) and RFC 2732 (plus errata) together comprised the previous version of the generic URI syntax specification.
  • RFC 1738 (mostly obsolete) and RFC 1808 (obsolete), which define URL
    Uniform Resource Locator
    In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....

    s.
  • RFC 1630 (obsolete), the first generic URI syntax specification.
  • W3C Guidelines on Naming and Addressing: URIs, URLs, ...
  • W3C explanation of UTF-8 in URIs
  • W3C HTML form content types
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK