Byte Order Mark - AbsoluteAstronomy.com

The byte order mark is a Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

character used to signal the endianness

Endianness

In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

(byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Usage

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (essentially a null character). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060. This allows U+FEFF to be only used as a BOM.

UTF-8

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.

The Unicode Standard does permit the BOM in UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

, but does not require or recommend its use. Byte order has no meaning in UTF-8 so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.

One reason the UTF-8 BOM is not recommended is that many pieces of software without Unicode support nevertheless are able to handle UTF-8 inside a text but not at the start of a text. For instance, the bytes of UTF-8 can be placed between the quotes of string constants in many programming languages, and that language will write the correct UTF-8 to a file or to a display, despite the language not knowing anything about UTF-8. This provides an easy migration path to convert systems to Unicode and to remove all legacy encodings, without simultaneously upgrading the programming language. The unexpected three bytes of the BOM break this however, as they are located where they are certain to be a syntax error.

A leading BOM can also defeat software that uses pattern matching on the start of a text file, since it inserts 3 bytes before the pattern. Though commonly associated with the Unix shebang

Shebang (Unix)

In computing, a shebang is the character sequence consisting of the characters number sign and exclamation point , when it occurs as the first two characters on the first line of a text file...

at the start of an interpreted script, the problem is more widespread. For instance in PHP

PHP

PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

, the existence of a BOM will cause the page to begin output before the initial code is interpreted, causing problems if the page is trying to send custom HTTP headers (which must be set before output begins).

Many Windows

Microsoft Windows

Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

programs (including Windows Notepad) add BOMs to UTF-8 files by default .

UTF-16

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xHexadecimal In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen... FE followed by 0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text display that expects the text to be ISO-8859-1, although UTF-16 text will be more or less unreadable if the text display doesn't support UTF-16.
if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a text display that expects the text to be ISO-8859-1.

Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).

For the IANA

Internet Assigned Numbers Authority

The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...

registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a "zero width no-break space".

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore the presumption of big-endian is widely ignored. When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for ASCII characters or just the space character (U+0020) is a method of determining the UTF-16 byte order.

UTF-32

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.

Representations of byte order marks by encoding

Encoding	Representation (hexadecimal Hexadecimal In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen... )	Representation (decimal Decimal The decimal numeral system has ten as its base. It is the numerical base most widely used by modern civilizations.... )	Representation (ISO-8859-1)
UTF-8 UTF-8 UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...	`EF BB BF`	`239 187 191`	`ï»¿`
UTF-16 (BE)	`FE FF`	`254 255`	`þÿ`
UTF-16 (LE)	`FF FE`	`255 254`	`ÿþ`
UTF-32 (BE)	`00 00 FE FF`	`0 0 254 255`	`□□þÿ` (□ is the ascii null character)
UTF-32 (LE)	`FF FE 00 00`	`255 254 0 0`	`ÿþ□□` (□ is the ascii null character)
UTF-7 UTF-7 UTF-7 is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters...	`2B 2F 76 38 2B 2F 76 39 2B 2F 76 2B 2B 2F 76 2F`In UTF-7, the fourth byte of the BOM, before encoding as base64 Base64 Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation... , is `001111xx` in binary, and `xx` depends on the next character (the first character after the BOM). Hence, technically, the fourth byte is not purely a part of the BOM, but also contains information about the next (non-BOM) character. For `xx=00`, `01`, `10`, `11`, this byte is, respectively, `38`, `39`, `2B`, or `2F` when encoded as base64. If no following character is encoded, `38` is used for the fourth byte and the following byte is `2D`.	`43 47 118 56 43 47 118 57 43 47 118 43 43 47 118 47`	`+/v8 +/v9 +/v+ +/v/`
UTF-1 UTF-1 UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design it is not possible to resynchronise if decoding starts in the middle of a character and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of...	`F7 64 4C`	`247 100 76`	`÷dL`
UTF-EBCDIC UTF-EBCDIC UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for...	`DD 73 66 73`	`221 115 102 115`	`Ýsfs`
SCSU Standard Compression Scheme for Unicode The Standard Compression Scheme for Unicode is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the...	`0E FE FF`	`14 254 255`	`□þÿ` (□ is the ascii "shift out" character)
BOCU-1	`FB EE 28`	`251 238 40`	`ûî`
GB-18030	`84 31 95 33`	`132 49 149 51`	`□1■3` (□ and ■ are unmapped ISO-8859-1 characters)

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.