Unicode and HTML
Encyclopedia
Web pages authored using hypertext markup language (HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

) may contain multilingual text represented with the Unicode universal character set.

The relationship between Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 and HTML tends to be a difficult topic for many computer professionals, document authors, and web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

 users alike. The accurate representation of text in web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

s from different natural language
Natural language
In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

s and writing system
Writing system
A writing system is a symbolic system used to represent elements or statements expressible in language.-General properties:Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to...

s is complicated by the details of character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

, markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...

 syntax, font
Typeface
In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....

, and varying levels of support by web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...

s.

HTML document characters

Web pages are typically HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 or XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....

 documents. Both types of documents consist, at a fundamental level, of character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....

s, which are graphemes and grapheme-like units, independent of how they manifest in computer storage
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....

 systems and network
Computer network
A computer network, often simply referred to as a network, is a collection of hardware components and computers interconnected by communication channels that allow sharing of resources and information....

s.

An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML document character set: a character repertoire wherein each character is assigned a unique, non-negative integer code point. This set is defined in the HTML 4.0 DTD
Document Type Definition
Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 and ISO/IEC 10646: the Universal Character Set
Universal Character Set
The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...

 (UCS).

Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 document, which, while not having an explicit "document character" layer of abstraction
Abstraction
Abstraction is a process by which higher concepts are derived from the usage and classification of literal concepts, first principles, or other methods....

, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.

Regardless of whether the document is HTML or XHTML, when stored on a file system
File system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

 or transmitted over a network, the document's characters are encoded as a sequence of bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

 octet
Octet (computing)
An octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, as there is no standard for the size of the byte.-Overview:...

s (byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

s
) according to a particular character encoding. This encoding may either be a Unicode Transformation Format, like UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, that can directly encode any Unicode character, or a legacy encoding, like Windows-1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

, that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of numeric character references. For example ☺ is used to indicate a smiling face character in the Unicode character set.

Character encoding

In order to support Unicode, a web page must have an encoding supporting Unicode. The most popular is UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, where the ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 characters, such as English letters, digits, and some other common characters are preserved unchanged against ASCII. This makes HTML code (such as <br> and </div>) unchanged compared to ASCII. Characters outside the ASCII range are stored in 2-4 bytes. It is also possible to use UTF-16 where most characters are stored as two bytes with varying endianness
Endianness
In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

, which is supported by modern browsers but less commonly used. It is also possible to use a one-byte-per-character encoding and use numeric character references (see below) but this is less common.

Numeric character references

In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference
Numeric character reference
A numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set of Unicode...

: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal
Decimal
The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations....

 number for the Unicode code point, or a hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

 number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.

For example, a Unicode code point like U+53F6, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: &#21494;, which produces this: 叶 (if it doesn't look like a Chinese character, see Template:Special characters).

The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example &#21494; instead of &#x53F6;).

Named character entities

In HTML there is a standard set of 252 named character entities for characters — some common, some obscure — that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.

Character entities can be included in an HTML document via the use of entity references, which take the form &EntityName;, where EntityName is the name of the entity. For example, &mdash;, much like &#8212; or &#x2014;, represents : the em dash character — like this — even if the character encoding used doesn't contain that character.

For the full list, see: List of XML and HTML character entity references.

Character encoding determination

In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used.

Encoding information

When a document is transmitted via a MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

 message or a transport that uses MIME content types such as an HTTP response, the message may signal the encoding via a Content-Type header, such as Content-Type: text/html; charset=UTF-8. Other external means of declaring encoding are permitted but rarely used. If the document uses an Unicode encoding
Comparison of Unicode encodings
This article compares Unicode encodings. Two situations are considered: 8-bit-clean environments and environments that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so...

, the encoding info might also be present in the form of a Byte order mark
Byte Order Mark
The byte order mark is a Unicode character used to signal the endianness of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream...

. Finally, the encoding can be declared via the HTML syntax. For the text/html serialisation then, as long as the page is encoded in an extension of ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 (such as UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, and thus, not if the page is using UTF-16), a meta element, like <meta http-equiv="content-type" content="text/html; charset=UTF-8"> or (starting with HTML5) <meta charset="UTF-8"/> can be used. For HTML pages serialized as XML, then declaration options is to either rely on the encoding default (which for XML documents is UTF-8), or to use an XML encoding declaration — se below. The meta attribute plays no role in HTML served as XML.

Encoding defaults

An encoding default applies when there is no external or internal encoding declaration and also no Byte order mark. While the encoding default for HTML pages served as XML is required to be UTF-8, the encoding default for a regular Web page (that is: for HTML pages serialized as text/html) varies depending on the localisation of the browser. For a system set up mainly for Western European languages, it will generally be Windows-1252. For the Russian locale, the default is typically Windows-1251
Windows-1251
Windows-1251 is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian, Serbian Cyrillic and other languages...

. For a browser from a location where legacy multibyte character encodings are prevalent, some form of autodetection is likely to be applied.

Encoding trends

Because of the legacy of 8-bit text representations in programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

s and operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

s and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. Misunderstandings, such as the belief that the encoding declaration affects a change in the actual encoding (whereas it is actually just a label that could be inaccurate), is also a reason for this editor attitude. Another factor contributing in the same direction, is the arrival of UTF-8 — which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specifiction , to UTF-8.

Byte order mark/Unicode sniffing

For both serializations of HTML (content-type "text/html" and content/type "application/xhtml+xml"), the Byte order mark (BOM) is an effective way to transmit encoding information within an HTML document. For UTF-8, the BOM is optional, while it is a must for the UTF-16 and the UTF-32 encodings. (Note: UTF-16 and UTF-32 without the BOM are formally known under different names, they are different encodings, and thus needs some form of encoding declaration – see UTF-16BE, UTF-16LE, UTF-32LE and UTF-32BE.) The use of the BOM character (U+FEFF) means that the encoding automatically declares itself to any processing application. Processing applications need only look for an initial 0x0000FEFF, 0xFEFF or 0xEFBBBF in the byte steam to identify the document as UTF-32, UTF-16 or UTF-8 encoded respectively. No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications. In most circumstances the byte-order mark character is handled by editing applications separately from the other characters so there is little risk of an author removing or otherwise changing the byte order mark to indicate the wrong encoding (as can happen when the encoding is declared in English/Latin script). If the document lacks a byte-order mark, the fact that the first non-blank printable character in an HTML document is supposed to be < (U+003C) can be used to determine a UTF-8/UTF-16/UTF-32 encoding.

Encoding overriding

Many HTML documents are served with inaccurate encoding information, or no encoding information at all. In order to determine the encoding in such cases, many browsers allow the user to manually select an encoding name from a list. They may also employ an encoding autodetection algorithm that works in concert with or — in the case of the BOM and in case of HTML served as XMLagainst the manual override.

For HTML documents which are text/html serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari — for both XML and text/html serializations — do not permit the encoding to be overridden whenever the page includes the BOM.

For HTML documents serialized with the preferred XML label — application/xhtml+xml, manual encoding override is not permitted. To override the encoding of such an XML document would mean that that the document stopped being XML, as it is a fatal error for XML documents to have an encoding declaration with detectable errors. Currently, Gecko browsers such as Firefox, abide to this rule, whereas the bulk of the other common browsers that support HTML as XML, such as Webkit browsers (Chrome/Safari) do — against the XML specificaiton — allow the encoding of XHTML documents to be manually overridden.

Web browser support

Many browsers are only capable of displaying a small subset of the full Unicode repertoire. Here is how your browser displays various Unicode code points:
Character HTML char ref Unicode name What your browser displays
U+0041 &#65; or &#x41; Latin
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

 capital letter A
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...

A
U+00DF &#223; or &#xDF; Latin small letter Sharp S
ß
In the German alphabet, ß is a letter that originated as a ligature of ss or sz. Like double "s", it is pronounced as an , but in standard spelling, it is only used after long vowels and diphthongs, while ss is used after short vowels...

ß
U+00FE &#254; or &#xFE; Latin small letter Thorn
Thorn (letter)
Thorn or þorn , is a letter in the Old English, Old Norse, and Icelandic alphabets, as well as some dialects of Middle English. It was also used in medieval Scandinavia, but was later replaced with the digraph th. The letter originated from the rune in the Elder Fuþark, called thorn in the...

þ
U+0394 &#916; or &#x394; Greek
Greek alphabet
The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

 capital letter Delta
Delta (letter)
Delta is the fourth letter of the Greek alphabet. In the system of Greek numerals it has a value of 4. It was derived from the Phoenician letter Dalet...

Δ
U+017D &#381; or &#x17D; Latin capital letter Z with caron (used in Central Europe) Ž
U+0419 &#1049; or &#x419; Cyrillic
Cyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...

 capital letter Short I
Short I
Short I is a letter of the Cyrillic alphabet. It is made of the Cyrillic letter И with a breve.Short I represents the palatal approximant , like the pronunciation of ⟨y⟩ in toy....

Й
U+05E7 &#1511; or &#x5E7; Hebrew
Hebrew alphabet
The Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...

 letter Qof
ק
U+0645 &#1605; or &#x645; Arabic
Arabic alphabet
The Arabic alphabet or Arabic abjad is the Arabic script as it is codified for writing the Arabic language. It is written from right to left, in a cursive style, and includes 28 letters. Because letters usually stand for consonants, it is classified as an abjad.-Consonants:The Arabic alphabet has...

 letter Meem
Meem
Meem is the letter Mem , the thirteenth letter of many Semitic language abjads, including Phoenician, Aramaic, Hebrew and ArabicMeem may also refer to:*Meem , a lesbian organization in Lebanon...

م
U+0E57 &#3671; or &#xE57; Thai
Thai alphabet
Thai script , is used to write the Thai language and other, minority, languages in Thailand. It has forty-four consonants , fifteen vowel symbols that combine into at least twenty-eight vowel forms, and four tone marks ....

 digit
Numerical digit
A digit is a symbol used in combinations to represent numbers in positional numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e...

 7
U+1250 &#4688; or &#x1250; Ge'ez
Ge'ez alphabet
Ge'ez , also called Ethiopic, is a script used as an abugida for several languages of Ethiopia and Eritrea but originated in an abjad used to write Ge'ez, now the liturgical language of the Ethiopian and Eritrean Orthodox Church...

 syllable Qha
U+3042 &#12354; or &#x3042; Hiragana
Hiragana
is a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora...

 letter A (Japanese)
U+53F6 &#21494; or &#x53F6; CJK Unified Ideograph
CJK Unified Ideographs
The Chinese, Japanese and Korean scripts share a common background. In the process called Han unification the common characters were identified, and named "CJK Unified Ideographs"...

-53F6 (Simplified Chinese "Leaf")
U+8449 &#33865; or &#x8449; CJK Unified Ideograph
CJK Unified Ideographs
The Chinese, Japanese and Korean scripts share a common background. In the process called Han unification the common characters were identified, and named "CJK Unified Ideographs"...

-8449 (Traditional Chinese "Leaf")
U+B5AB &#46507; or &#xB5AB; Hangul
Hangul
Hangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean...

 syllable
Syllable
A syllable is a unit of organization for a sequence of speech sounds. For example, the word water is composed of two syllables: wa and ter. A syllable is typically made up of a syllable nucleus with optional initial and final margins .Syllables are often considered the phonological "building...

 Tteolp (Korean "Ssangtikeut Eo Rieulbieup")
U+16A0 &#5792; or &#x16A0; Runic
Runic alphabet
The runic alphabets are a set of related alphabets using letters known as runes to write various Germanic languages before the adoption of the Latin alphabet and for specialized purposes thereafter...

 letter Fehu
Fe (rune)
The Fe rune represents the f-sound in the Younger Futhark and Futhorc alphabets. Its name means " wealth", cognate to English fee with the original meaning of "sheep" or "cattle" .The rune derives from the unattested but reconstructed Proto-Germanic...

To display all of the characters above, you may need to install one or more large multilingual fonts, like Code2000
Code2000
Code2000 is a pan-Unicode digital font, which includes characters and symbols from a very large range of writing systems. As of the current final version 1.171 released in 2008, Code2000 is designed and implemented by James Kass to include as much of the Unicode 5.2 standard as practical , and to...

.


Some web browsers, such as Mozilla Firefox
Mozilla Firefox
Mozilla Firefox is a free and open source web browser descended from the Mozilla Application Suite and managed by Mozilla Corporation. , Firefox is the second most widely used browser, with approximately 25% of worldwide usage share of web browsers...

, Opera
Opera (web browser)
Opera is a web browser and Internet suite developed by Opera Software with over 200 million users worldwide. The browser handles common Internet-related tasks such as displaying web sites, sending and receiving e-mail messages, managing contacts, chatting on IRC, downloading files via BitTorrent,...

, Safari
Safari (web browser)
Safari is a web browser developed by Apple Inc. and included with the Mac OS X and iOS operating systems. First released as a public beta on January 7, 2003 on the company's Mac OS X operating system, it became Apple's default browser beginning with Mac OS X v10.3 "Panther". Safari is also the...

 and Internet Explorer
Internet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...

 (from version 7 on), are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode blocks
Mapping of Unicode characters
Unicode’s Universal Character Set has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point which is an integer between 0 and 1,114,111 used to represent each character within the internal logic of text processing software .As of Unicode 5.2.0,...

, as long as appropriate fonts are present in the operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

.

Older browsers such as Netscape Navigator
Netscape Navigator
Netscape Navigator was a proprietary web browser that was popular in the 1990s. It was the flagship product of the Netscape Communications Corporation and the dominant web browser in terms of usage share, although by 2002 its usage had almost disappeared...

 4.77 and Internet Explorer 6
Internet Explorer 6
Internet Explorer 6 is the sixth major revision of Internet Explorer, a web browser developed by Microsoft for Windows operating systems...

 can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others.

For displaying characters outside the Basic Multilingual Plane, like the Gothic letter faihu, which is variant of runic letter Fehu in the table above, some systems (like Windows 2000) need manual adjustments of their settings.

Frequency of usage

According to internal data from Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

's web index, in December 2007 the UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 Unicode encoding became the most frequently used encoding on web pages, overtaking both ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 (US) and 8859-1
ISO/IEC 8859-1
ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987. It is informally referred to as Latin-1. It is generally...

/1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

 (Western European).

See also

  • Help file for using special characters on Wikipedia
  • Character encodings in HTML
    Character encodings in HTML
    HTML has been in use since 1991, but HTML 4.0 was the first standardized version where international characters were given reasonably complete treatment...

  • Charset detection
    Charset detection
    Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns...

  • Unicode character reference (wikibooks)

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK