Bzip2
Encyclopedia
bzip2 is a free
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 and open source
Open-source software
Open-source software is computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under a software license that permits users to study, change, improve and at times also to distribute the software.Open...

 implementation of the Burrows–Wheeler algorithm. It is developed and maintained by Julian Seward
Julian Seward
Julian Seward is a compiler writer and Free Software contributor who lives in Cambridge, UK. He is commonly known for creating the bzip2 compression tool, as well as the valgrind memory debugging toolset founded in 2000...

. Seward made the first public release of bzip2, version 0.15, in July 1996.

Compression efficiency

bzip2 compresses most files more effectively than the older LZW (.Z
Compress
Compress is a UNIX compression program based on the LZC compression method, which is an LZW implementation using variable size pointers as in LZ78.- Description of program :Files compressed by compress are typically given the extension .Z...

) and Deflate
DEFLATE
Deflate is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951....

 (.zip
ZIP (file format)
Zip is a file format used for data compression and archiving. A zip file contains one or more files that have been compressed, to reduce file size, or stored as is...

 and .gz
Gzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

) compression algorithms, but is considerably slower. LZMA is generally more efficient than bzip2 at the expense of slower compression speed, while having much faster decompression.

bzip2 compresses data in blocks of size between 100 and 900 kB
Kilobyte
The kilobyte is a multiple of the unit byte for digital information. Although the prefix kilo- means 1000, the term kilobyte and symbol KB have historically been used to refer to either 1024 bytes or 1000 bytes, dependent upon context, in the fields of computer science and information...

 and uses the Burrows–Wheeler transform to convert frequently-recurring character sequences into strings of identical letters. It then applies move-to-front transform
Move-to-front transform
The move-to-front transform is an encoding of data designed to improve the performance of entropy encoding techniques of compression...

 and Huffman coding
Huffman coding
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...

. bzip2's ancestor bzip used arithmetic coding
Arithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...

 instead of Huffman. The change was made because of a software patent
Software patent
Software patent does not have a universally accepted definition. One definition suggested by the Foundation for a Free Information Infrastructure is that a software patent is a "patent on any performance of a computer realised by means of a computer program".In 2005, the European Patent Office...

 restriction.

bzip2 performance is asymmetric, as decompression is relatively fast. Motivated by the large CPU time required for compression, a modified version was created in 2003 called pbzip2 that supported multi-threading
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

, giving almost linear speed improvements on multi-CPU and multi-core computers. , this functionality has not been incorporated into the main project.

Like gzip, bzip2 is only a data compressor. It is not an archiver like RAR or ZIP; the program itself has no facilities for multiple files, encryption or archive-splitting, but, in the UNIX
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

 tradition, relies instead on separate external utilities such as tar
Tar (file format)
In computing, tar is both a file format and the name of a program used to handle such files...

 and GnuPG
GNU Privacy Guard
GNU Privacy Guard is a GPL Licensed alternative to the PGP suite of cryptographic software. GnuPG is compliant with RFC 4880, which is the current IETF standards track specification of OpenPGP...

 for these tasks.

Compression stack

Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression:
  1. Run-length encoding
    Run-length encoding
    Run-length encoding is a very simple form of data compression in which runs of data are stored as a single data value and count, rather than as the original run...

     (RLE): any sequence of 4 to 255 consecutive duplicate symbols is replaced by the first four symbols and a repeat length between 0 and 251. Thus the sequence "AAAAAAABBBBCCCD" is replaced with "AAAA\3BBBB\0CCCD". Runs of symbols are always transformed after four consecutive symbols, even if the run-length is set to zero, to keep the transformation reversible. In the worst case, it can cause an expansion of 1.25 and best case a reduction to <0.02 . While the specification theoretically allows for runs of length 256–259 to be encoded, the reference encoder will not produce such output. The author of bzip2 has stated that the RLE step was a historical mistake and was only intended to protect the original BWT implementation from pathological cases.
  2. Burrows–Wheeler transform (BWT): this is the reversible block-sort that is at the core of bzip2. The block is entirely self contained, with input and output buffers remaining the same size—in bzip2, the operating limit for this stage is 900 kB. For the block-sort, a (notional) matrix is created in which row contains the whole of the buffer, rotated to start from the symbol. Following rotation, the rows of the matrix are sorted into alphabetic (numerical) order. A 24-bit pointer is stored marking the starting position for when the block is untransformed. In practice, it is not necessary to construct the full matrix; rather, the sort is performed using pointers for each position in the buffer. The output buffer is the last column of the matrix; this contains the whole buffer, but reordered so that it is likely to contain large runs of identical symbols.
  3. Move to front
    Move-to-front transform
    The move-to-front transform is an encoding of data designed to improve the performance of entropy encoding techniques of compression...

     (MTF): again, this transform does not alter the size of the processed block. Each of the symbols in use in the document is placed in an array. When a symbol is processed, it is replaced by its location (index) in the array and that symbol is shuffled to the front of the array. The effect is that immediately recurring symbols are replaced by zero symbols (long runs of any arbitrary symbol thus become runs of zero symbols), while other symbols are remapped according to their local frequency.
    A lot of "natural" data contains identical symbols that recur within a limited range (text is a good example). As the MTF transform assigns low values to symbols that reappear frequently, this results in a data stream which contains a lot of symbols in the low integer range, many of them being identical (different recurring input symbols can actually map to the same output symbol). Such data can be very efficiently encoded by any legacy compression method.
  4. Run-length encoding (RLE): long strings of repeated symbols in the output (normally zeros by this time) are replaced by a combination of the symbol and a sequence of two special codes, RUNA and RUNB, which represent the run-length as a binary number greater than one (1). The sequence 0,0,0,0,0,1 would be represented as 0,RUNB,RUNA,1; RUNB and RUNA representing the value 4 in decimal. The run-length code is terminated by reaching another normal symbol. This RLE process is more flexible than the RLE of step 1, as it is able to encode arbitrarily long integers (in practice, this is usually limited by the block size, so that this step does not encode a run of more than 900000 bytes). The run-length is encoded in this fashion: assigning place values of 1 to the first bit, 2 to the second, 4 to the third, etc. in the RUNA/RUNB sequence, multiply each place value in a RUNB spot by 2, and add all the resulting place values (for RUNA and RUNB values alike) together. Thus, the sequence RUNB, RUNA results in the value (1*2 + 2) = 4. As a more complicated example:
    RUNA RUNB RUNA RUNA RUNB (ABAAB)
       1    2    4    8   16
       1    4    4    8   32 = 49
  5. Huffman coding
    Huffman coding
    In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...

    : this process replaces fixed length symbols in the range 0–258 with variable length codes based on the frequency of use. More frequently used codes end up shorter (2-3 bits) whilst rare codes can be allocated up to 20 bits. The codes are selected carefully so that no sequence of bits can be confused for a different code.
    The end-of-stream code is particularly interesting. If there are n different bytes (symbols) used in the uncompressed data, then the Huffman code will consist of two RLE codes (RUNA and RUNB), n-1 symbol codes and one end-of-stream code. Because of the combined result of the MTF and RLE encodings in the previous two steps, there is never any need to explicitly reference the first symbol in the MTF table, thus saving one symbol for the end-of-stream marker (and explaining why only n-1 symbols are coded in the Huffman tree). In the extreme case where only one symbol is used in the uncompressed data, there will be no symbol codes at all in the Huffman tree, and the entire block will consist of RUNA and RUNB (implicitly repeating the single byte) and an end-of-stream marker with value 2.
    0: RUNA
    1: RUNB
    2-257: byte values 0-255
    258: end of stream, finish processing. (could be as low as 2).
  6. Multiple Huffman tables
    Huffman coding
    In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...

    : several identically-sized Huffman tables can be used with a block if the gain from using them is greater than the cost of including the extra table. At least two (2) and up to six (6) tables can be present, with the most appropriate table being reselected before every 50 symbols processed. This has the advantage of having very responsive Huffman dynamics without having to continuously supply new tables, as would be required in DEFLATE
    DEFLATE
    Deflate is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951....

    . Run-length encoding in the previous step is designed to take care of codes that have an inverse probability of use higher than the shortest code Huffman code in use.
  7. Unary base 1 encoding: if multiple Huffman tables are in use, the selection of each table (numbered 0..5) is done from a list by a zero-terminated bit run between one (1) and six (6) bits in length. The selection is into a MTF
    Move-to-front transform
    The move-to-front transform is an encoding of data designed to improve the performance of entropy encoding techniques of compression...

     list of the tables. Using this feature results in a maximum expansion of around 1.015, but generally less. This expansion is likely to be greatly over-shadowed by the advantage of selecting more appropriate Huffman tables and the common-case of continuing to use the same Huffman table is represented as a single bit. Rather than unary encoding, effectively this is an extreme form of a Huffman tree where each code has half the probability of the previous code.
  8. Delta encoding
    Delta encoding
    Delta encoding is a way of storing or transmitting data in the form of differences between sequential data rather than complete files; more generally this is known as data differencing...

     (Δ): Huffman code bit-lengths are required to reconstruct each of the used Canonical Huffman tables. Each bit-length is stored as an encoded difference against the previous code bit-length. A zero-bit (0) means that the previous bit-length should be duplicated for the current code, whilst a one-bit (1) means that a further bit should be read and the bit-length incremented or decremented based on that value. In the common case a single bit is used per symbol per table and the worst case—going from length one (1) to length twenty (20)—would require approximately 37 bits. As a result of the earlier MTF encoding, code lengths would start at 2-3bits long (very frequently used codes) and gradually increase, meaning that the delta format is fairly efficient—requiring around 300-bits (38 bytes) per full Huffman table.
  9. Sparse bit array: a bitmap is used to show which symbols are used inside the block and should be included in the Huffman trees. Binary data is likely to use all 256 symbols representable by a byte, whereas textual data may only use a small subset of available values, perhaps covering the ASCII
    ASCII
    The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

     range between 32 and 126. Storing 256 zero bits would be inefficient if they were mostly unused. A sparse method is used: the 256 symbols are divided up into 16 ranges and only if symbols are used within that block is a 16-bit array included. The presence of each of these 16 ranges is indicated by an additional 16-bit bit array at the front. The total bitmap uses between 32 and 272 bits of storage (4–34 bytes). For contrast, the DEFLATE
    DEFLATE
    Deflate is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951....

     algorithm would show the absence of a symbol by encoding the symbol as having zero bit-length; this is very inefficient with the Delta-encoding used in bzip2 which is the reason for bzip2 having a separate bit array.

File format

A .bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks, immediately followed by an end-of-stream marker containing a 32-bit CRC for the plaintext whole stream processed. The compressed blocks are bit-aligned and no padding occurs.


.magic:16 = 'BZ' signature/magic number
.version:8 = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated)
.hundred_k_blocksize:8 = '1'..'9' block-size 100 kB-900 kB

.compressed_magic:48 = 0x314159265359 (BCD (pi))
.crc:32 = checksum for this block
.randomised:1 = 0=>normal, 1=>randomised (deprecated)
.origPtr:24 = starting pointer into BWT for after untransform
.huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present
.huffman_used_bitmaps:0..256 = bitmap, of symbols used, present/not present (multiples of 16)
.huffman_groups:3 = 2..6 number of different Huffman tables in use
.selectors_used:15 = number of times that the Huffman tables are swapped (each 50 bytes)
  • .selector_list:1..6 = zero-terminated bit runs (0..62) of MTF'ed Huffman table (*selectors_used)

.start_huffman_length:5 = 0..20 starting bit length for Huffman deltas
  • .delta_bit_length:1..40 = 0=>next symbol; 1=>alter length

{ 1=>decrement length; 0=>increment length } (*(symbols+2)*groups)
.contents:2..∞ = Huffman encoded data stream until end of block

.eos_magic:48 = 0x177245385090 (BCD sqrt(pi))
.crc:32 = checksum for whole stream
.padding:0..7 = align to whole byte


Because of the first-stage RLE compression (see above), the maximum length of plaintext that a single 900 kB bzip2 block can contain is around 46 MB (45,899,235 bytes). This can occur if the whole plaintext consists entirely of repeated values (the resulting .bz2 file in this case is 46 bytes long). An even smaller file of 40 bytes can be achieved by using an input containing entirely values of 251, an apparent compression ratio of 1147480:1.

Implementations

  • bzip2: Julian Seward
    Julian Seward
    Julian Seward is a compiler writer and Free Software contributor who lives in Cambridge, UK. He is commonly known for creating the bzip2 compression tool, as well as the valgrind memory debugging toolset founded in 2000...

    's original reference implementation available under a BSD license.
  • 7-Zip
    7-Zip
    7-Zip is an open source file archiver. 7-Zip operates with the 7z archive format, but can read and write several other archive formats. The program can be used from a command line interface, graphical user interface, or with Microsoft Windows shell integration. 7-Zip began in 1999 and is actively...

    : written by Igor Pavlov in C++
    C++
    C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

    , the 7-Zip suite contains a bzip2 encoder/decoder which is freely licensed. 7-Zip comes with multi-threading support.
  • micro-bzip2: a version by Rob Landley designed for reduced compiled code size and available under the GNU LGPL.
  • PBZIP2: Parallel pthreads
    POSIX Threads
    POSIX Threads, usually referred to as Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions , defines an API for creating and manipulating threads....

    -based implementation in C++
    C++
    C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

     by Jeff Gilchrist (and Windows version).
  • SelfImage: Nominally a Windows disk imager, SelfImage also handles parallel bzip2 file de/compression
  • bzip2smp: a modification to libbzip2 that has SMP
    Symmetric multiprocessing
    In computing, symmetric multiprocessing involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture...

     parallelisation "hacked in" by Konstantin Isakov.
  • smpbzip2: Another go at parallel bzip2, by Niels Werensteijn.
  • pyflate: a pure-Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

     stand-alone bzip2 and DEFLATE
    DEFLATE
    Deflate is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool and was later specified in RFC 1951....

     (gzip
    Gzip
    Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

    ) decoder by Paul Sladen. Probably useful for research and prototyping, made available under the BSD/GPL
    GNU General Public License
    The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....

    /LGPL
    GNU Lesser General Public License
    The GNU Lesser General Public License or LGPL is a free software license published by the Free Software Foundation . It was designed as a compromise between the strong-copyleft GNU General Public License or GPL and permissive licenses such as the BSD licenses and the MIT License...

    , or any other DFSG
    Debian Free Software Guidelines
    The Debian Free Software Guidelines is a set of guidelines that the Debian Project uses to determine whether a software license is a free software license, which in turn is used to determine whether a piece of software can be included in Debian...

    -compatible license.
  • Arnaud Bouchez's BZ2 Delphi implementation using C library or optimized i386 assembler code
  • lbzip2: Parallel pthreads
    POSIX Threads
    POSIX Threads, usually referred to as Pthreads, is a POSIX standard for threads. The standard, POSIX.1c, Threads extensions , defines an API for creating and manipulating threads....

    -based bzip2/bunzip2 (bzip2 compressor/decompressor) filter for sequential access input/output, by László Érsek.
  • Source game engine uses bzip2 to make custom file downloads smaller and automatically unpackages them.
  • mpibzip2: An MPI
    Message Passing Interface
    Message Passing Interface is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers...

    -enabled implementation that compresses/decompresses in parallel, by Jeff Gilchrist, available under a BSD-style license.
  • Apache Commons Compress Apache Commons Compress project contains Java implementations of Bzip2 compressor and decompressor.
  • jbzip2 A Java implementation of bzip2 made available under the MIT license
    MIT License
    The MIT License is a free software license originating at the Massachusetts Institute of Technology . It is a permissive license, meaning that it permits reuse within proprietary software provided all copies of the licensed software include a copy of the MIT License terms...

  • DotNetZip includes a C# implementation of bzip2, derived from the Java implementation in Apache Commons Compress. Includes a multi-threaded .NET bzip2 encoder/decoder library, and a bzip2 command-line tool in managed code.

See also

  • Comparison of archive formats
    Comparison of archive formats
    There are many popular computer data archive formats for creating and maintaining archive files. The tables below compare many popular archive formats.-Purpose:The earliest use of archive formats was for backup, mobility, and archiving....

  • List of archive formats
  • List of file archivers
  • Comparison of file archivers
    Comparison of file archivers
    The following tables compare general and technical information for a number of file archivers. Please see the individual products' articles for further information. They are neither all-inclusive nor are some entries necessarily up to date...

  • List of Unix programs
  • rzip
    Rzip
    The rzip program is huge-scale data compression software designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding on 900 kB output chunks....


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK