Charset detection
Encyclopedia
Character encoding detection, charset detection, or code page detection is the process of heuristic
Heuristic
Heuristic refers to experience-based techniques for problem solving, learning, and discovery. Heuristic methods are used to speed up the process of finding a satisfactory solution, where an exhaustive search is impractical...

ally guessing the character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

 would mis-detect the phrase "Bush hid the facts
Bush hid the facts
Bush hid the facts is a common name for a bug present in the function IsTextUnicode of Microsoft Windows, which causes a file of text encoded in Windows-1252 or similar encoding to be interpreted as if it were UTF-16LE, resulting in mojibake...

" in ASCII as Chinese UTF-16LE.

One of the few cases where charset detection works reliably is detecting UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. Unfortunately badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding.

Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents can declare their encoding in a meta element, thus:



Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band
Out-of-band
The term out-of-band has different uses in communications and telecommunication. In case of out-of-band control signaling, signaling bits are sent in special order in a dedicated signaling frame...

 using the Content-type header.

See also

  • International Components for Unicode
    International Components for Unicode
    International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

    - A library that can perform charset detection.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK