Binary Ordered Compression for Unicode
Encyclopedia
Binary Ordered Compression for Unicode (BOCU) is a MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

 compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 with the compactness of Standard Compression Scheme for Unicode
Standard Compression Scheme for Unicode
The Standard Compression Scheme for Unicode is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the...

 (SCSU). This Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.

For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code page
Code page
Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems, but many vendors use this term including Microsoft, SAP, and Oracle Corporation...

s. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2
Bzip2
bzip2 is a free and open source implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward. Seward made the first public release of bzip2, version 0.15, in July 1996.-Compression efficiency:...

, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.

Both SCSU and BOCU-1 are IANA
Internet Assigned Numbers Authority
The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...

 registered charsets.

Details

All numbers in this section are hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...

, and all ranges are inclusive.

Code points from U+0000 to U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, U+0021 through U+D7FF and U+E000 through U+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). The initial state is U+0040. The normalization mapping is as follows:
Code range Normalized code point Notes
U+3040 to U+309F U+3070 Hiragana
Hiragana
is a Japanese syllabary, one basic component of the Japanese writing system, along with katakana, kanji, and the Latin alphabet . Hiragana and katakana are both kana systems, in which each character represents one mora...

U+4E00 to U+9FA5 U+7711 Unihan
U+AC00 to U+D7A3 U+C1D1 Hangul
Hangul
Hangul,Pronounced or ; Korean: 한글 Hangeul/Han'gŭl or 조선글 Chosŏn'gŭl/Joseongeul the Korean alphabet, is the native alphabet of the Korean language. It is a separate script from Hanja, the logographic Chinese characters which are also sometimes used to write Korean...

U+0020 encoder state kept as is Space
U+hhhh00 to U+hhhh7F
(excluding ranges above)
U+hhhh40 middle
of 128
U+hhhh80 to U+hhhhFF
(excluding ranges above)
U+hhhhC0 middle
of 128


The difference between the current code point and the normalized previous code point is encoded as follows:
Difference range Byte sequence range
(see below)
-10FF9F to -2DD0D 21 F0 58 D9 to 21 FF FF FF
-2DD0C to -2912 22 01 01 to 24 FF FF
-2911 to -41 25 01 to 4F FF
-40 to 3F 50 to CF
40 to 2910 D0 01 to FA FF
2911 to 2DD0B FB 01 01 to FD FF FF
2DD0C to 10FFBF FE 01 01 01 to FE 19 B4 54


Each byte range is lexicographically ordered
Lexicographical order
In mathematics, the lexicographic or lexicographical order, , is a generalization of the way the alphabetical order of words is based on the alphabetical order of letters.-Definition:Given two partially ordered sets A and B, the lexicographical order on...

 with the following thirteen byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequence FC 06 FF, coding for a difference of 1156B, is immediately followed by the byte sequence FC 10 01, coding for a difference of 1156C.

Any ASCII input U+0000 to U+007F excluding space U+0020 resets the encoder to U+0040. Because the above mentioned values cover line end code points U+000D and U+000A as is (0D 0A), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 affects at most one code point, for SCSU
Standard Compression Scheme for Unicode
The Standard Compression Scheme for Unicode is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks. It does so by dynamically mapping values in the...

 it can affect the entire document.

BOCU-1 offers a similar robustness also for input texts without the above mentioned values with the special reset code 0xFF. When a decoder finds this octet it resets its state to U+0040 as for a line end. The use of 0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the binary order.

The optional use of a signature U+FEFF at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence FB EE 28, changes the initial state U+0040 to U+FE80. In other words the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF) could avoid this effect, but the BOCU-1 specification does not recommend this practice.

In theory UTF-1
UTF-1
UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design it is not possible to resynchronise if decoding starts in the middle of a character and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of...

 and UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 could encode the original UCS-4
Universal Character Set
The Universal Character Set , defined by the International Standard ISO/IEC 10646, Information technology — Universal multiple-octet coded character set , is a standard set of characters upon which many character encodings are based...

 set with 31 bits up to 7FFFFFFF. BOCU-1 and UTF-16 can encode
the modern Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 set from U+0000 to U+10FFFF. Excluding the thirteen protected code points encoded as single octets BOCU-1 can use octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo
Modulo operation
In computing, the modulo operation finds the remainder of division of one number by another.Given two positive numbers, and , a modulo n can be thought of as the remainder, on division of a by n...

 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference.
Note that the reset byte 0xFF is not protected and can occur as trail byte.

Patent

The general BOCU algorithm is covered by United States Patent
United States patent law
United States patent law was established "to promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;" as provided by the United States Constitution. Congress implemented these...

 #6,737,994, which also mentions the specific BOCU-1 implementation. IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

, which employed both of the inventors of BOCU-1 at the time it was created, states in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" must contact IBM to request a royalty-free license. BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with intellectual property
Intellectual property
Intellectual property is a term referring to a number of distinct types of creations of the mind for which a set of exclusive rights are recognized—and the corresponding fields of law...

 restrictions.

By contrast, IBM also filed for a patent on UTF-EBCDIC
UTF-EBCDIC
UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for...

, but it chose in that case to make the documentation and encoding scheme “freely available to anyone concerned towards making the transformation format as part of the UCS standards,” instead of requiring implementers to request a license.

See also

  • UTF-1
    UTF-1
    UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design it is not possible to resynchronise if decoding starts in the middle of a character and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of...

     contains a comparison of the UTF-1, UTF-8
    UTF-8
    UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

    , and BOCU-1 designs
  • International Components for Unicode
    International Components for Unicode
    International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

    A library that can convert between BOCU-1 and other Unicode encodings
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK