Bush hid the facts
Encyclopedia
Bush hid the facts is a common name for a bug
Software bug
A software bug is the common term used to describe an error, flaw, mistake, failure, or fault in a computer program or system that produces an incorrect or unexpected result, or causes it to behave in unintended ways. Most bugs arise from mistakes and errors made by people in either a program's...

 present in the function IsTextUnicode of Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

, which causes a file of text encoded in Windows-1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

 or similar encoding to be interpreted as if it were UTF-16LE, resulting in mojibake
Mojibake
, from the Japanese 文字 "character" + 化け "change", is the occurrence of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.-Causes:...

. When "Bush hid the facts" (without newline) is put in a new Notepad document and saved, closed, and reopened, the words "畂桳栠摩琠敨映捡獴" (Liu Benrenmotian Touyingjianmeng) appear instead.

While "Bush hid the facts" is the sentence most commonly presented on the Internet to induce the error, the bug can be triggered by many sentences with characters and spaces in a particular order so that the bytes match the UTF-16LE encoding of valid (if nonsensical) Chinese Unicode characters. Other popular strings are "this app can break", “acre vai pra globo”, and "aaaa aaa aaa aaaaa".

The bug occurs when the string is passed to the Win32 charset detection
Charset detection
Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns...

 function IsTextUnicode with no other characters. IsTextUnicode sees what it thinks is valid UTF-16LE Chinese and returns true, and the application then incorrectly interprets the text as UTF-16LE.

Many text editors and tools exhibit this behavior because they use IsTextUnicode as well.

Discovery

The bug appeared for the first time in Windows NT 3.5
Windows NT 3.5
Windows NT 3.5 is the second release of the Microsoft Windows NT operating system. It was released on 21 September 1994.One of the primary goals during Windows NT 3.5's development was to increase the speed of the operating system; as a result, the project was given the codename "Daytona" in...

 but was not discovered until early 2004. Older versions of Notepad such as those that came with Windows 95, 98, ME, and NT 3.1 do not include Unicode support so the bug does not occur.

The bug existed in all successive versions of Windows through Windows XP
Windows XP
Windows XP is an operating system produced by Microsoft for use on personal computers, including home and business desktops, laptops and media centers. First released to computer manufacturers on August 24, 2001, it is the second most popular version of Windows, based on installed user base...

. This bug does not occur in Windows Vista
Windows Vista
Windows Vista is an operating system released in several variations developed by Microsoft for use on personal computers, including home and business desktops, laptops, tablet PCs, and media center PCs...

 and Windows 7 because their version of IsTextUnicode has been altered to make it much more likely to guess a byte-based encoding rather than UTF-16LE.

Workarounds

Editing the text to not be a pattern that triggers this bug will fix it, for instance adding a new line in the first 20 characters will work.

If the file is saved as "UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

" rather than "ANSI" (which in reality means Windows-1252
Windows-1252
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...

 on systems using western European languages) the text displays correctly, because Notepad prepends the UTF-8 byte order mark
Byte Order Mark
The byte order mark is a Unicode character used to signal the endianness of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream...

, which is a different pattern that does not trigger this bug. UTF-8 without the byte order mark would still trigger the bug, as it is identical to ASCII.

The bug is also avoided by saving as "Unicode", which in reality saves as UTF-16LE.

To retrieve the original text using Notepad, bring up the "Open a file" dialog box, select the file, select "ANSI" or "UTF-8" in the "Encoding" list box, and click Open. (Under Windows 2000, Notepad lacks the "Encoding" list box. Notepad2
Notepad2
Notepad2 is an open-source text editor for Microsoft Windows, released under a BSD software license. It is written by Florian Balmer using the Scintilla editor component, and it was first publicly released in April 2004...

 makes the same error (by trusting IsTextUnicode), and also lacks an option to override encoding when opening a file. However, WordPad opens the text file correctly by default.)

External links

  • The Notepad file encoding problem, redux – Raymond Chen
  • IsTextUnicodeMSDN Library
    MSDN Library
    MSDN Library is a library of official technical documentation content intended for developers developing for Microsoft Windows. MSDN stands for the Microsoft Developer Network. The MSDN Library documents the APIs that ship with Microsoft products and also includes sample code, technical articles,...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK