Canonicalization
Encyclopedia
In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, canonicalization (abbreviated c14n, where 14 represents the number of letters between the C and the N), (also sometimes standardization or normalization) is a process for converting data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 that has more than one possible representation into a "standard", "normal", or canonical form
Canonical form
Generally, in mathematics, a canonical form of an object is a standard way of presenting that object....

. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

s by eliminating repeated calculations, or to make it possible to impose a meaningful sorting
Sorting
Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...

 order.

Web servers

Canonicalization of filenames is important for computer security. For example, a web server may have a security rule stating "only execute files under the cgi directory (C:\inetpub\wwwroot\cgi-bin)". The rule is enforced by checking that the path starts with "C:\inetpub\wwwroot\cgi-bin\", and if it does, the file is executed.

Should file "C:\inetpub\wwwroot\cgi-bin\..\..\..\Windows\System32\cmd.exe" be executed? No, because this trick path goes back up the directory hierarchy (through use of the '..' path specifier), not staying within cgi-bin. Accepting it at face value would be an error due to failure to canonicalize the filename to the unique (simplest) representation, namely: "C:\Windows\System32\cmd.exe", before doing the path check. This type of fault is called a directory traversal
Directory traversal
A directory traversal consists in exploiting insufficient security validation / sanitization of user-supplied input file names, so that characters representing "traverse to parent directory" are passed through to the file APIs....

 vulnerability.

Unicode

Variable-length encodings in the Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 standard, in particular UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, have more than one possible encoding for most common characters. This makes string validation more complicated, since every possible encoding of each string character must be considered. A software implementation which does not consider all character encodings runs the risk of accepting strings considered invalid in the application design, which could cause bugs or allow attacks. The solution is to allow a single encoding for each character. Canonicalization is then the process of translating every string character to its single allowed encoding. An alternative is for software to determine whether a string is canonicalized, and then reject it if it is not. In this case, in a client/server context, the canonicalization would be the responsibility of the client.

Search engines and SEO

In web search and search engine optimization
Search engine optimization
Search engine optimization is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid search results...

 (SEO), URL canonicalization deals with web content that has more than one possible URL. Having multiple URLs for the same web content can cause problems for search engines - specifically in determining which URL should be shown in search results.

Example:
  • http://wikipedia.com
  • http://www.wikipedia.com
  • http://www.wikipedia.com/
  • http://www.wikipedia.com/?source=asdf


All of these URLs point to the homepage of Wikipedia, but a search engine will only consider one of them to be the canonical form of the URL.

XML

A Canonical XML
Canonical XML
Canonical XML is a profile or subset of XML. Any XML document can be converted to Canonical XML, thus normalizing away specific kinds of minor differences while remaining an XML document...

document is by definition an XML document that is in XML Canonical form, defined by The Canonical XML specification. Briefly, canonicalization removes whitespace within tags, uses particular character encodings, sorts namespace references and eliminates redundant ones, removes XML and DOCTYPE declarations, and transforms relative URIs into absolute URIs.

Simple Xml example: Given two versions of the same XML:
  • "Data    Data"
  • "Data Data"

Note the extra spaces in the samples, the canonicalized version of these two might be:
  • "DataData"

Note that the spaces are removed — this is one thing a canonicalizier does. A real canonicalizier may make other changes as well.

A full summary of canonicalization changes is listed below:
  • The document is encoded in UTF-8
  • Line breaks normalized to #xA on input, before parsing
  • Attribute values are normalized, as if by a validating processor
  • Character and parsed entity references are replaced
  • CDATA sections are replaced with their character content
  • The XML declaration and document type declaration are removed
  • Empty elements are converted to start-end tag pairs
  • Whitespace outside of the document element and within start and end tags is normalized
  • All whitespace in character content is retained (excluding characters removed during line feed normalization)
  • Attribute value delimiters are set to quotation marks (double quotes)
  • Special characters in attribute values and character content are replaced by character references
  • Superfluous namespace declarations are removed from each element
  • Default attributes are added to each element
  • Fixup of xml:base attributes is performed
  • Lexicographic order is imposed on the namespace declarations and attributes of each element

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK