A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-based markup languages such as HTML
HTML
HTML, an Acronym and initialism of HyperText Markup Language, is the predominant markup language for Web pages. It provides a means to describe the structure of text-based information in a document?by denoting certain text as links, headings, paragraphs, lists, and so on?and to supplement that text with interactive forms, embedded '... and XML. It consists of a short sequence of character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written language form of a natural language.... s that, in turn, represent a single character from the Universal Character Set
Universal Character Set
The Universal Character Set , defined by the International Organization for Standardization/International Electrotechnical Commission 10646 International Organization for Standardization, is a standard set of character s upon which many character encodings are based.... (UCS) of Unicode
Unicode
Unicode is a computing industry standard allowing computers to consistently represent and manipulate Character expressed in most of the world's writing systems.... . NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.
rce lang="xml">
Σ
Σ
Σ
Σ
Discussion Markup languages are typically defined in terms of UCS or Unicode characters.
Discussion
Ask a question about 'Numeric character reference'
Start a new discussion about 'Numeric character reference'
Answer questions from other users
Full Discussion Forum
Encyclopedia
A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-based markup languages such as HTML
HTML
HTML, an Acronym and initialism of HyperText Markup Language, is the predominant markup language for Web pages. It provides a means to describe the structure of text-based information in a document?by denoting certain text as links, headings, paragraphs, lists, and so on?and to supplement that text with interactive forms, embedded '... and XML. It consists of a short sequence of character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written language form of a natural language.... s that, in turn, represent a single character from the Universal Character Set
Universal Character Set
The Universal Character Set , defined by the International Organization for Standardization/International Electrotechnical Commission 10646 International Organization for Standardization, is a standard set of character s upon which many character encodings are based.... (UCS) of Unicode
Unicode
Unicode is a computing industry standard allowing computers to consistently represent and manipulate Character expressed in most of the world's writing systems.... . NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.
Example
In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma ("S"):
Σ
Σ
Σ
Σ
Discussion
Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding
Character encoding
A character encoding system consists of a code that pairs a sequence of character from a given character set with something else, such as a sequence of natural numbers, octet or electrical pulses, in order to facilitate the transmission of data through telecommunication networks and/or Computer data storage of Character in compute... .
Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bit
Bit
A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory.... s, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence.
Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte
Byte
A byte is a basic unit of measurement of Computer storage in computer science. In many computer architectures it is a Byte addressing memory address space.... each.
Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.
The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.
Character references that are based on the referenced character's UCS or Unicode "code point" are called numeric character references. In HTML 4 and in all versions of XHTML
XHTML
The Extensible Hypertext Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax.... and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows:
Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:
one or more decimal digits zero (U+0030) through nine (U+0039); or
character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066);
all followed by character U+003B (semicolon). Older versions of HTML disallowed the hexadecimal syntax.
The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.
In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named SGML entity that has been predefined or explicitly declared in a Document Type Definition .... , which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity
SGML entity
In Standard Generalized Markup Language and its derived markup languages HTML and XML, an entity is a named body of data associated with a document, or the unnamed document entity itself.... .) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.
Restrictions
The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS.
While the syntax of SGML does not prohibit references to unassigned code points, such as , SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters or that have not been permanently left unassigned.
Restrictions may also apply for other reasons. For example, in HTML 4, , which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference. As another example, €, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers—some of which attempt to interpret it as a reference to the character represented by code value 128 in the Windows-1252
Windows-1252
Windows-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages.... encoding: "€", which actually should be represented as €. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like 𐀀 (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended.
Markup languages also place restrictions on where character references can occur.
In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named SGML entity that has been predefined or explicitly declared in a Document Type Definition ....
In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of character s, in which each character can manifest directly , or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a ...