ZPAQ
Encyclopedia
ZPAQ is a proposed open format
Open format
An open file format is a published specification for storing digital data, usually maintained by a standards organization, which can therefore be used and implemented by anyone. For example, an open format can be implementable by both proprietary and free and open source software, using the typical...

 for highly compressed data, mainly created by Matt Mahoney
Matt Mahoney
Matt Mahoney is a programmer and researcher on data compression, who is well known for creation of an open compression standard and API, libZPAQ, which is released into Public domain, for the creation of data compressor, PAQ, which is released in GPLv3, and for creating the Hutter prize, and many...

. It is designed so that new compression algorithms may be developed without breaking compatibility with older versions of ZPAQ-compliant decompression programs. ZPAQ uses a configurable context mixing
Context mixing
Context mixing is a type of data compression algorithm in which the next-symbol predictions of two or more statistical models are combined to yield a prediction that is often more accurate than any of the individual predictions. For example, one simple method is to average the probabilities...

 algorithm (based on PAQ
PAQ
PAQ is a series of lossless data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio . Specialized versions of PAQ have won the Hutter Prize and the Calgary Challenge...

) and sandboxed
Sandbox (computer security)
In computer security, a sandbox is a security mechanism for separating running programs. It is often used to execute untested code, or untrusted programs from unverified third-parties, suppliers, untrusted users and untrusted websites....

 post-processing code. ZPAQ uses a streaming data format that supports archives, single file compression, and memory to memory compression. Several ZPAQ compatible compression programs and configurations are available under the GPL license. There is also a public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...

 API in C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 providing compression and decompression services. The format is believed to be unencumbered by patent
Patent
A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....

s.

Overview

Initial work on ZPAQ was begun by Matt Mahoney on Feb. 15, 2009. After a series of incompatible experimental versions, the level 1 standard was finalized on Mar. 12, 2009. The standard consisted of a document describing the format and a reference decoder (unzpaq) written in C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 and released under GPL. The standard does not define a compression algorithm. The standard specifies that any ZPAQ level 1 compliant decoder be able to read the output of any other ZPAQ level 1 compliant compressor. The standard has a provision for higher levels which must be able to read the output of lower level compliant compressors but not vice versa. However, no higher levels have yet been defined.

The following software was released under GPL:
  • unzpaq - the reference decoder.
  • zpaq - an archiver and environment for developing new compression algorithms, which are described in configuration files and external preprocessors.
  • zpipe - a simple streaming compressor that reads from standard input and compresses or decompresses to standard output.
  • zp - an archiver with three built in configurations (fast, mid, max) to trade off speed vs. size.
  • zpaqsfx - a stub for creating self extracting archives for 32-bit Windows.


In addition, several configurations and preprocessors were published including specialized models for text, BMP images, JPEG images, x86 executable data, and pre/post processors using LZP and BWT compression. Many of these were written by Jan Ondrus.

Data format

ZPAQ uses a streaming data format. A ZPAQ stream consists of a sequence of blocks which can be decompressed independently. Blocks can be reordered, which has the effect of reordering the data it represents. Each block consists of a sequence of one or more segments, which must be decompressed in order from the beginning of the block.

Each block header describes the algorithm needed to decompress its contents and restore the original data. Each segment describes a file, part of a file, or an array of bytes. A segment header contains an optional file name and optional comment to support archives, the compressed data, and end of data marker (4 zero bytes) and finally an optional SHA-1 checksum of the original data for integrity checking. The data is compressed using an arithmetic coder that never outputs 4 zero bytes in a row, which could be confused with an end of segment marker.

Compression algorithm

ZPAQ uses a context mixing
Context mixing
Context mixing is a type of data compression algorithm in which the next-symbol predictions of two or more statistical models are combined to yield a prediction that is often more accurate than any of the individual predictions. For example, one simple method is to average the probabilities...

 algorithm and an optional pre/post processor. A sandboxed assembly-like language called ZPAQL is used to compute the contexts for the context mixing algorithm and to perform any required post-processing. If post-processing is used, then the ZPAQL post-processing code is compressed prior to compressing the first segment of the block. In this case, the remainder of the block after decompression is considered input to the post-processing program.

A context mixing model compresses data one bit at a time. During compression, the model estimates the probability that the next bit will be a 0 or 1. The prediction is arithmetic coded
Arithmetic coding
Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code...

. During decompression, the model makes an identical sequence of predictions based on previously decoded data. This prediction and the compressed data are used to decode the next bit. The decoded bits are then packed 8 at a time into bytes. An extra bit after each byte is used to mark the end of the data segment. It is predicted with a very low probability.

The context mixing model is described as an array of 1 to 255 components. Each component inputs a context (a function of past data computed by the supplied ZPAQL code) and possibly the predictions of earlier components, and outputs a new prediction. The output of the model is the output of the last component. ZPAQ describes 9 component types.
  • CONST - The prediction is a fixed value.
  • CM (context model) - The context is mapped to a prediction using a table. Then the table entry is updated by adjusting the prediction to reduce the error. The context is also mapped to a count which is used to control the adjustment. The adjustment is inversely proportional to the count, and then the count is incremented up to a user specified limit.
  • ICM (indirect context model) - The context is mapped to a bit history (an 8 bit state), and then the bit history is mapped to a prediction and updated as with a CM. The bit history represents a count of recent zeros and ones and the value of the most recent bit.
  • MIX - The predictions of a slice of the component array are combined by weighted averaging in the logistic domain, i.e. p = squash(∑i wi stretch(pi)), where wi are the mixing weights, pi are the input predictions, stretch(p) = ln(p/(1-p)), squash(x) = 1/(1+e-x)) is the inverse of stretch, and p is the output prediction that the next bit will be a 1. The array of weights are selected by a context. On update, the weights are adjusted in proportion to the output prediction error times stretch(pi).
  • MIX2 - A 2 input MIX from any other 2 components where the weights are constrained to add to 1.
  • AVG - A MIX2 with fixed weights.
  • SSE (secondary symbol estimation) - A prediction and a context are used to look up a new prediction from a table. The input prediction is first stretched and quantized to 32 levels. The output prediction is interpolated from the nearest two entries. On update, the prediction error is reduced as with a CM.
  • ISSE (indirect secondary symbol estimation) - The input is a prediction from another component and a context. The context is used to look up a bit history as with an ICM. The bit history is used to select a pair of weights for a 2 input MIX. The prediction is then combined with a CONST and updated as with a MIX.
  • MATCH - The context is used to look up the previous occurrence of the same context in a rotating history buffer, and the next bit in the buffer is predicted with a strength depending on the length of the match.

Configurations

The zpaq compressor reads a configuration file which describes the compression format. The contents are stored in the block header and may be read and executed by any compliant decompresser. Other compressors such as zpipe and zp use a set of built in compression configurations instead.

Fast configuration

The following is the "fast" configuration, which is also built into the zp archiver (compression level 1). It is designed for higher speed at the expense of compression. Generally, speed depends on the number of components, in this case 2.


(ZPAQ fast configuration)
comp 1 2 0 0 2 (hh hm ph pm n)
0 icm 16 (order 2)
1 isse 19 0 (order 4)
hcomp
*b=a a=0 (save in rotating buffer M)
d=0 hash b-- hash *d=a
d++ b-- hash b-- hash *d=a
halt
post
0
end


Configuration files are not case sensitive. Comments are enclosed in parenthesis.

The COMP section describes the components that make up the model. The HCOMP section contains ZPAQL code which computes the contexts for the components. POST 0 indicates that there is no post-processing. The decompressed data is output directly.

The COMP section describes two components, an ICM and an ISSE. The ICM (component number 0) takes as input an order 2 context hash (depending on the last 2 bytes of context plus any previously decoded bits from the current byte) to look up a bit history. The bit history indexes a table to look up the prediction. That prediction is fed to the ISSE (component 1), which also takes an order 4 context as input. Its output prediction goes to the arithmetic decoder. The arguments to the two components (16 and 19) give the size of the tables as 216+6 and 219+6 bit histories, respectively. Each bit history uses one byte. Thus, the model uses about 36 MiB of memory.

The HCOMP section describes the ZPAQL code which computes the context hashes for the two components. The ZPAQL machine model consist of:
  • 32-bit registers A, B, C, D, and R0 through R255.
  • One bit flag register F for comparisons and conditional jumps.
  • Array H of 32-bit values, used to store contexts for the component array.
  • Array M of 8-bit values.


The 5 arguments to COMP give the sizes of H and M for the HCOMP and POST code, respectively, and the number of components. In this case, H has 21 elements and M has 22 elements.

The ZPAQL instructions are usually 1 or 2 bytes. Most arithmetic and logic instructions except assignment operate on A, B, C, D, *B, *C, *D or an 8 bit constant and put the result in A. *B and *C refer to elements of M. *D refers to an element of H. The HASH instruction computes A ← (A + *B + 512) * 773. The HCOMP code is called once per byte with the input byte in A and all other registers preserved between calls.

Thus, the code first saves the input byte A in M using B as a pointer, then computes an order 2 context hash, stores it in H[0] (pointed to by *D), then hashes the next two bytes and stores it in H[1]. These hashes are updated only once per byte, so they are combined by adding the bitwise contexts from the current byte for each bit prediction.

Mid configuration

The mid level configuration is shown below. It is the default model for zpaq, zp, and zpipe. It provides a fairly good level of compression for most data at some cost in speed.


comp 3 3 0 0 8 (hh hm ph pm n)
0 icm 5 (order 0...5 chain)
1 isse 13 0
2 isse $1+17 1
3 isse $1+18 2
4 isse $1+18 3
5 isse $1+19 4
6 match $1+22 $1+24 (order 7)
7 mix 16 0 7 24 255 (order 1)
hcomp
c++ *c=a b=c a=0 (save in rotating buffer M)
d= 1 hash *d=a (orders 1...5 for isse)
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash *d=a
b-- d++ hash b-- hash *d=a (order 7 for match)
d++ a=*c a<<= 8 *d=a (order 1 for mix)
halt
post
0
end


The compression model is similar to the one used in PAQ9A. It uses an order 0 ICM, a chain of order 1 through 5 ISSE, and an order 7 MATCH. All of these predictions are combined using a MIX with an order 1 context. The notation $1 means that an argument may be passed to the configuration file and this argument is added to the values shown. This has the effect of doubling memory usage for each increment because it is used to select the ISSE table sizes and the MATCH index table and buffer sizes respectively. The default uses 111 MB of memory. The arguments to MIX are the table size (216), range of inputs (7 components starting at 0), learning rate (24) and a bit mask (255) which tells it to include the bits of the current byte to select the weight array. The second argument to each ISSE selects the previous component prediction as input.

Max configuration

The following configuration is the maximum compression level for zp.


comp 5 9 0 0 22 (hh hm ph pm n)
0 const 160
1 icm 5 (orders 0-6)
2 isse 13 1 (sizebits j)
3 isse $1+16 2
4 isse $1+18 3
5 isse $1+19 4
6 isse $1+19 5
7 isse $1+20 6
8 match $1+22 $1+24
9 icm $1+17 (order 0 word)
10 isse $1+19 9 (order 1 word)
11 icm 13 (sparse with gaps 1-3)
12 icm 13
13 icm 13
14 icm 14 (pic)
15 mix 16 0 15 24 255 (mix orders 1 and 0)
16 mix 8 0 16 10 255 (including last mixer)
17 mix2 0 15 16 24 0
18 sse 8 17 32 255 (order 0)
19 mix2 8 17 18 16 255
20 sse 16 19 32 255 (order 1)
21 mix2 0 19 20 16 0
hcomp
c++ *c=a b=c a=0 (save in rotating buffer)
d= 2 hash *d=a b-- (orders 1,2,3,4,5,7)
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash b-- hash *d=a b--
d++ hash *d=a b-- (match, order 8)
d++ a=*c a&~ 32 (lowercase words)
a> 64 if
a< 91 if (if a-z)
d++ hashd d-- (update order 1 word hash)
*d<>a a+=*d a*= 20 *d=a (order 0 word hash)
jmp 9
endif
endif
(else not a letter)
a=*d a 0 ifnot (move word order 0 to 1)
d++ *d=a d--
endif
*d=0 (clear order 0 word hash)
(end else)
d++
d++ b=c b-- a=0 hash *d=a (sparse 2)
d++ b-- a=0 hash *d=a (sparse 3)
d++ b-- a=0 hash *d=a (sparse 4)
d++ a=b a-= 212 b=a a=0 hash
*d=a b<>a a-= 216 b<>a a=*b a&= 60 hashd (pic)
d++ a=*c a<<= 9 *d=a (mix)
d++
d++
d++ d++
d++ *d=a (sse)
halt
post
0
end


Components 1 through 8 are a ICM-ISSE chain similar to the mid configuration. Components 9 and 10 are order 0 and 1 word level contexts which improve text compression. The corresponding ZPAQL code converts lower case to upper case, then computes a context hash of bytes that fall in the range 65 to 90 (uppercase ASCII). When a non-letter is found, the order 1 hash is updated and the order 0 hash is cleared. These two components form their own ICM-ISSE chain with contexts in increasing order as before.

Components 11 through 13 are sparse order 1 contexts formed from the current byte plus one preceding byte with a gap of 1 to 3 bytes. These models are not chained. Sparse models can improve compression of some binary files.

Component 14 is specialized for CCITT fax images such as the file PIC in the Calgary Corpus
Calgary Corpus
The Calgary Corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms. It was created by Ian Witten, Tim Bell and John Cleary from the University of Calgary in 1987 and was commonly used in the 1990s...

. These files encode a bitmapped image 1728 pixels or 216 bytes per line. It obtains a context hash of the corresponding bytes in the previous two scan lines.

Components 15 and 16 mix all previous predictions using an order 1 and order 0 context respectively. Component 17 mixes their outputs. This prediction is then passed through 2 SSE stages (components 18 and 20) with order 0 and 1 contexts respectively. For each stage, another mixer combines the input and output, which can sometimes improve compression further.

LZP (min) configuration

The following configuration (min.cfg) is part of the zpaq distribution intended for minimal but fast compression. It uses LZP preprocessing followed by a simple order 2 context model. Both the compressor and decompresser maintain a hash table that maps high order context hashes to the last occurrence in a rotating history buffer. If the text that follows the current location matches the text that follows the previous occurrence of the same context hash, then the current text is removed and replaced by a two byte code consisting of an escape character and the length of the match. The effect is to remove duplicate instances of long matches in the input. For example, the string "compression and decompression" might be encoded as "compression and decom,ESC,8" to tell the decompresser to copy the 8 bytes that followed the last instance of the order 3 context "com".

If the ESC byte occurs in the input, then it is encoded as ESC,0. Normally the ESC byte is chosen so that the need for such encodings are rare. Matches are encoded as ESC,1 through ESC,255. Additional parameters to LZP are the sizes of the context hash table and history buffer, the order (length) of the context, and the minimum match length. By default the minimum match length is 3 so that ESC,1 tells the decompresser to copy 3 bytes, and ESC,255 tells the decompresser to copy 257 bytes.

ZPAQ only defines a decompression format. Thus, a compressor would be responsible for pre-processing the input using the LZP algorithm, compressing it, and appending the post-processing code in ZPAQL to the archive header. When used with the zpaq compressor, the pre-processor is an external program written in C++ which is called by the compressor to create a temporary file. The post-processing code is shown below.


(zpaq 1.07 minimum (fast) compression.
Uses 4 x 2^$1 MB memory. $2 increases minimum match length.
Requires lzppre as an external preprocessor.

(C) 2009, Ocarina Networks Inc. Written by Matt Mahoney.
This software is free under GPL v3.
http://www.gnu.org/copyleft/gpl.html )

comp 3 3 $1+18 $1+20 1 (hh hm PH PM n)
0 cm $1+19 5 (context model size=2^19, limit=5*4)
hcomp
*d<>a a^=*d a<<= 8 *d=a (order 2 context)
halt
pcomp lzppre $1+18 $1+20 127 $2+2 96 ;
(lzppre PH PM ESC MINLEN HMUL)
(If you change these values, then change them in the code too)

(The sequence ESC 0 codes for ESC. The sequence ESC LEN
codes for a match of length LEN+MINLEN at the last place
in the output buffer M (size 2^PM) that had the same context
hash in the low PH bits of D. D indexes hash table H
which points into buffer M, which contains B bytes.
When called, A contains the byte to be decoded and F=true
if the last byte was ESC. The rolling context hash D is
updated by D=D*HMUL+M[B])

if (last byte was ESC then copy from match)
a> 0 jf 50 (goto output esc)
a+= $2+2 (MINLEN)
r=a 0 (save length in R0)
c=*d (c points to match)
do (find match and output it)
*d=b (update index with last output byte)
a=*c *b=a b++ c++ out (copy and output matching byte)
d<>a a*= 96 (HMUL)
a+=d d<>a (update context hash)
a=r 0 a-- r=a 0 (decrement length)
a> 0 while (repeat until length is 0)
halt
endif

(otherwise, set F for ESC)
a 127 (ESC) if
halt
endif

(reset state at EOF)
a> 255 if
b=0 c=0 a= 1 a<<= $1+18 d=a
do
d-- *d=0 a=d (clear index)
a> 0 while
halt (F=0)

(goto here: output esc)
a= 127 (ESC)
endif
*d=b (update index)
*b=a b++ out (update buffer and output)
d<>a a*= 96 (HMUL)
a+=d d<>a (update context hash)
halt
end


The COMP and HCOMP section describe a simple context model with one component. Other low-order models could be substituted here for better compression at the cost of speed. The purpose of the LZP preprocessor is to compress and remove high order statistics so that a fast, low memory, low order model can be used.

The PCOMP section describes the post-processing code. The external program "lzppre" takes the 5 arguments shown in addition to the names of the input and output (temporary) files. These arguments are the sizes of the context hash table and history buffer (as powers of 2, respectively 218 and 220), the value of the ESC byte (127), the match length offset (2), and the hash multiplier (HMUL = 96). These values must match the corresponding values in the ZPAQL post-processor for correct decoding. The match length offset is added to the encoded length (1 to 255) to represent matches of length 3 to 257. The compressor arguments $1 and $2 can be passed by the user to control memory allocation and the minimum match length respectively. These values are passed to both the preprocessor and the embedded post-processing code.

The hash function is HASH ← (HASH + A) * HMUL (mod 218) (depending on the context hash table size), where A is the next input byte. If HMUL = 96 = 3 * 25, then the high 5 bits of the hash are shifted out for each byte. Thus HASH depends on only the last 4 bytes of input, effectively computing an order 4 context hash.

In the ZPAQL code, M is used as a history buffer and H is used as the context hash table. The flag bit F is used to encode whether the last input byte was an ESC. The C register is used as a pointer into M to where the next byte will be stored. D is the current context hash, which points into H. The OUT instruction outputs the contents of A to the decompressed output file. On input to the PCOMP program, A is the decoded byte from the decompressor (prior to LZP decoding), or 232 - 1 at the end of a segment. The program is called once for each input byte.

Burrows-Wheeler transform configuration

The configuration bwt_j3 (code not shown) by Jan Ondrus pre-processes the input with a Burrows-Wheeler transform
Burrows-Wheeler transform
The Burrows–Wheeler transform , is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while working at DEC Systems Research Center in Palo Alto, California...

 (BWT). The input is sorted by context, which removes any high order statistics. The post-processor performs an inverse transform. The transformed data is compressed using a fast adapting order 0-1-2-4 ICM-ISSE chain followed by an order 0 MIX and an order 0 SSE adaptively bypassed by a MIX2. The model is based on the BBB http://mattmahoney.net/dc/#bbb compressor.

BWT normally divides the input into blocks and uses 5 times the block size in memory to compress and decompress. A second configuration, bwt_slowmodel, uses a slower but more memory efficient algorithm modeled on a similar algorithm used in BBB. Blocks are compressed using 1.25 times the block size by sorting in smaller blocks to temporary files and then merging. The inverse transform, instead of building and traversing a linked list, uses a smaller index to find the approximate location of the next decoded byte, followed by a final linear search. By conserving memory, larger blocks can be used, which improves compression.

JPEG configuration

The configuration jpg_test2 by Jan Ondrus compresses JPEG
JPEG
In computing, JPEG . The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality....

 files. JPEG is already compressed, but it is possible to compress it further by encoding the Huffman coded
Huffman coding
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on...

 coefficients using a better model. jpg_test2 uses a preprocessor to expand the packed Huffman codes to whole bytes or byte pairs. Then it uses a 32 component model to re-compress the codes. The model consists of a CONST and 22 ICM and slow adapting CM components, followed by a hierarchy of 5 mixers and a 2 stage SSE chain with partial adaptive (MIX2) bypass.

BMP configuration

The configuration bmp_j4 by Jan Ondrus compresses .BMP bitmapped image files. The model is based on a similar model in paq8px_v64, which is slow but has a good compression ratio. It first uses a preprocessor to implement a color transform: (R, G, B) -> (G, R-G, B-G) on the image data following the 54 byte header. The context model reads the header to get the image width and then computes a slow adapting context mixing model with 33 components using various combinations of neighboring pixels as context. Except for the contexts, the model is similar to jpg_test2 with one CONST, 17 slow adapting CM, 6 ICM, and a hierarchy of 5 mixers and a 2 SSE chain with partial adaptive bypass.

x86 configuration

The configuration exe_j1 by Jan Ondrus is optimized to compress 32 bit x86 executable code (.exe and .dll files). It uses a preprocessor to implement an E8E9 transform. The transform converts relative CALL and JUMP (opcode 0xE8 and 0xE9) addresses to absolute addresses. Because programs often make several calls or jumps to the same target, converting to absolute addresses can increase redundancy. The transformed data is then compressed using the MAX configuration described above.

Prediction representation

All component predictions are represented in the logistic domain as 12 bit fixed point numbers with 6 bits after the decimal point, i.e. in the range -32 to +32 with 1/64 resolution. When a prediction is calculated outside of this range, it is clamped to the nearest representable value.

When a prediction is converted to the linear domain for error estimation or arithmetic coding, it is represented as a 15 bit unsigned integer as an odd multiple of 2−16 in the range of 2−16 to 1 - 2−16.

Context calculation

A ZPAQ model predicts one bit at a time. Each of the components therefore requires an updated context for each bit prediction and update. As a speed optimization, contexts are only updated once per byte by executing the ZPAQL code in the HCOMP section of the block header. This requires that the contexts be further updated after each bit. That is done using a fixed algorithm that depends on the type of component.

Direct context mapping

For a MIX, MIX2, or SSE, the bitwise context, which depends on the last 0 to 7 bits since the last byte boundary, is added to the bytewise context computed by the HCOMP program. The addition is modulo the size of the component. Since the component size is always a power of 2, this is equivalent to dropping the high order bits of the sum. The bitwise context is a 1 bit followed by the 0 to 7 bits since the last byte boundary. Thus, it ranges from 1 to 255.

A low order context is usually computed such that the low 8 bits are 0 to leave room for adding the bitwise context. A higher order context is usually a hash using all available bits.

Cache aligned mapping

For a CM, the current byte is divided into 2 4-bit nibble
Nibble
In computing, a nibble is a four-bit aggregation, or half an octet...

s with a leading 1 bit added to each one, resulting in a 9 bit value as shown in the table below. Then the bitwise context is exclusive-ored with the bytewise context (usually a hash). The purpose is to preserve cache locality for better speed. Normally a CM is a table of 4 byte entries (22 bit prediction and 10 bit count packed into a 32 bit integer) aligned on a 64 byte boundary. On many processors, a cache line is 64 bytes. By combining the bitwise contexts this way, it guarantees that 4 consecutive table lookups address the same cache line, resulting in at most 2 cache misses per byte. This method does not help with a MIX, MIX2, or SSE because the combined context normally indexes array elements that are too large to pack into cache lines.
Bits Direct Cache aligned
0 (empty) 00000001 000000001
1 x 0000001x 00000001x
2 xx 000001xx 0000001xx
3 xxx 00001xxx 000001xxx
4 xxxx 0001xxxx 1xxxx0001
5 xxxxx 001xxxxx 1xxxx001x
6 xxxxxx 01xxxxxx 1xxxx01xx
7 xxxxxxx 1xxxxxxx 1xxxx1xxx


A low order context is usually set directly such that the low 9 bits are 0. A high order context is usually a hash.

Bit history hash tables

For an ICM or ISSE, a context addresses a bit history which occupies one byte in a hash table. The context is normally a hash which is updated on nibble boundaries. This hash indexes a 16 byte array containing 15 bit histories and a 1 byte checksum used to detect hash collisions. The hash table is looked up every 4 bits. Then subsequent bitwise contexts are looked up within the 15 byte array of bit histories. This secondary index is calculated by appending a 1 to the 0 to 3 bits following the nibble boundary, resulting in an index in the range 1 to 15. The primary hash index, computed every 4 bits, is equal to (bytewise context) + 16*(direct bitwise context).

To preserve cache locality, the hash table can be aligned on a 64 byte boundary. A 32-bit hash index is computed by the HCOMP code. The low bits (depending on size) are used to index the hash table, and the next higher 8 bits are used as a checksum confirmation. A table element is found by comparing the checksum at 3 adjacent locations and picking the first match. The 3 indexes are computed by exclusive-or'ing the hash with 0, 1, and 2. This ensures that the three lookups stay within a cache line.

If no checksum match is found, then the element with the lowest bit history count (n0 + n1) in the first (nibble aligned) context is replaced. This implements a least frequently used
Least frequently used
In computer science, the term "Least Frequently Used" refers to a cache algorithm for memory management. The expiration policy removes entities from the cache that are used the least...

 cache replacement policy.

Direct context model (CM)

A context model is a table that maps a context to a 22 bit linear prediction p (range 0 to 1 with precision 2-22) and a 10 bit count n (range 0 to 1023). Initially, p = 1/2, n = 0. On update with bit y (0 or 1), the prediction error y - p is calculated, and the state is updated:
  • p ← p + error/(n + 1.5)
  • n ← min(n + 1, limit)


where the limit is specified by the user. Usually a higher limit compresses better for stationary
Stationary process
In the mathematical sciences, a stationary process is a stochastic process whose joint probability distribution does not change when shifted in time or space...

 sources and a lower limit is better for fast adaption to changing statistics.

Bit history (ICM, ISSE)

An ICM or ISSE maps a context to a bit history through a hash table as described above. In an ICM, the bit history is mapped to a prediction using a CM with a maximum limit of 1023. In an ISSE, the bit history is used to select a weight table for a 2 input MIX. The MIX takes a prediction from another component and mixes it with a CONST prediction of 1 in the logistic domain. The effect in either case is to model both stationary and nonstationary sources effectively.

For example, suppose we observe a certain context 8 times and record the sequence 00000001. Now we observe the context again and wish to predict the next bit. If the source were stationary, we would assume that P(1) = 1/8. If it were nonstationary then we would give greater weight to recent events and estimate p(1) > 1/8. If the source were random data, then P(1) = 1/2. An indirect model answers the question adaptively by observing what happened when the same bit history was observed in other contexts, by counting how many times a 0 or 1 was observed in these cases.

A bit history is a tuple
Tuple
In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an n-tuple is a sequence of n elements, where n is a positive integer. There is also one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of an ordered pair...

 (n0, n1, LB) representing a count of 0 and 1 bits and the value of the last bit, respectively. Thus, the sequence 00000001 would be represented as (7, 1, 1). To fit a bit history into a single byte (256 possible values), the values are restricted as indicated in the table below. The restrictions are symmetric with respect to n0 and n1. Furthermore, LB is represented only when 1 ≤ n0 + n1 ≤ 17.
n0 n1
0 0..20
1 0..48
2 0..15
3 0..8
4 0..6
5 0..5
6 0..4
7..8 0..3
9..15 0..2
16..20 0..1
21..48 1


The update rule is to set LB to the most recent bit, increment the appropriate count and if the counts exceeds the bounds above to go to the nearest state that approximately preserves the same ratio of n0 to n1. For example, a 1 bit in state (5, 5, 0) would go to (5, 6, 1), but since this state would exceed the bounds, the next state is rounded to (4, 5, 1) instead.

Furthermore, when one of the counts is large and the opposite bit is observed, a portion of the larger count is first discarded. The rule is that 6 or 7 is decremented and larger counts are reduced to 7. For example if state (20, 0, 0) is updated by a sequence of 1 bits, then the following states would be (7, 1, 1), (6, 2, 1), (5, 3, 1), (5, 4, 1), (5, 5, 1), (4, 5, 1). The effect is to better model sources that are likely to be nonstationary.

An ICM or ISSE effectively models a wide range of sources, but for stationary sources a CM with a high limit still gives the best results. An ISSE is usually used in a chain of contexts of increasing order, starting with a low order CM or ICM. An ISSE learns to adjust its input prediction, i.e. it learns the weights w1 and w2 such that for input prediction p (in the logistic domain), w1 + w2p gives the best possible prediction.

Mixing (AVG, MIX2, MIX)

An AVG averages two predictions in the logistic domain: p = w1p1 + w2p2, where the weights are selected by the user and constrained to add to 1.

A MIX2 allows the weights to adjust, but they are still constrained to add to 1. A weight is represented as an unsigned 16 bit fixed point integer in the range 0 to 1. The initial weights are 1/2.

A MIX allows any number of inputs and does not constrain the weights to add to 1. A weight is represented as a signed 20 bit fixed point number with 16 bits after the decimal, i.e. a range of -8 to +8 with precision 2−16. Initially the weights are equal and add to 1. The update for each weight is:
  • wi ← min(8, max(-8, wi + ηpi(y - squash(p))))


where η is the user specified learning rate, p is the output prediction, y is the actual value of the predicted bit, and pi is the input prediction in the logistic domain. Effectively, a MIX is a simple 2-layer neural network
Neural network
The term neural network was traditionally used to refer to a network or circuit of biological neurons. The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes...

 such that the weights converge by gradient descent
Gradient descent
Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point...

 to minimize coding cost.

SSE

An SSE inputs a prediction and a context (like an ISSE) and outputs a new prediction. The stretched (logistic domain) prediction is quantized to one of 32 levels in the range -16 to 16 in steps of 1/2 to select two adjacent entries from a CM-like table (a 22 bit prediction and 10 bit count). The prediction is interpolated and only the closer of the two table entries is updated as with a CM. The limit is selectable as with a CM.

MATCH

A MATCH model predicts a bit by looking up the last point in a history buffer with the same context hash and predicting the next bit that follows. Once a match is found, the length of the match (in bytes) is maintained until a bit is predicted incorrectly. Once a prediction fails, remaining predictions in the current byte are 1/2 (0 in the logistic domain). At the next byte boundary, a new match location is found by computing and looking up the context hash, and the match length is determined by comparing backward from the current and looked-up positions in the history buffer. The context hash table and history buffer are updated after every byte whether or not a lookup occurs.

If the match length is 0 then the prediction is 1/2. Otherwise the next bit is predicted with probability 1 - 1/16L where L is the match length in bytes and L cannot exceed 255.

Arithmetic coding

ZPAQ specifies only the arithmetic decoder, although the encoder has a similar design. The decoder state is a pair of 32 bit unsigned integers representing a range (LOW, HIGH), initially (1, 232 - 1), and a current value CURR, initialized to the first 4 bytes of compressed data. The state is constrained at all times such that LOW ≤ CURR ≤ HIGH, 0 < LOW < HIGH, except at the end of input when CURR = 0 < LOW. The decoding (and encoding) algorithm is designed so that a sequence of 4 zero bytes never appears in the compressed data. This allows a sequence of 4 zeros to mark the end of the compressed data so that streams of compressed and uncompressed data can be freely mixed and the end of the compressed data can be found quickly without decompressing.

The decoding algorithm is as follows. The input is a prediction P, 0 ≤ P < 1 as a 16 bit fixed point integer. The return value is the decoded bit Y, either 0 or 1. All arithmetic is modulo 232. input reads one byte.


decode(P)
MID ← LOW + floor((HIGH - LOW) * P)
If CURR ≤ MID then Y ← 1, HIGH ← MID
else Y ← 0, LOW ← MID + 1.
While (LOW >> 24) = (HIGH >> 24) do
LOW ← LOW * 256
If LOW = 0 then LOW ← 1
HIGH ← HIGH * 256 + 255
CURR ← CURR * 256 + input
Return Y.


The corresponding encoding algorithm is as follows. It takes P as before and a bit Y to be encoded. LOW and HIGH are initialized as in decode. output writes one byte.


encode(P, Y)
MID ← LOW + floor((HIGH - LOW) * P)
If Y = 1 then HIGH ← MID
else LOW ← MID + 1.
While (LOW >> 24) = (HIGH >> 24) do
output(LOW >> 24)
LOW ← LOW * 256
If LOW = 0 then LOW ← 1
HIGH ← HIGH * 256 + 255


Predictions are always encoded with P > 0. After 8 bit predictions, an EOS (end of segment) bit is coded with P = 0. The EOS bit is 1 after the last byte is encoded. In the encoder, this has the effect of setting HIGH ← LOW and flushing 4 bytes to output. After this, 4 zero bytes would be written. Decoding the EOS bit causes those 4 zero bytes to be read into CURR, which would be an invalid state if decoding were to continue.

ZPAQL

A ZPAQ decoder has up to two virtual machines called HCOMP and PCOMP. HCOMP is called once for each byte of decoded data. It is used to compute bytewise contexts for the decompression model. PCOMP, which is optional, is called once for each byte of decoded data and once at end of segment (EOS). If present, then it outputs the decompressed program. Otherwise, the output of the decoder is output directly.

A ZPAQL virtual machine has the following state:
  • 32 bit registers A (accumulator), B, C, D (index registers), and R0 through R255,
  • 1 bit register F (comparison result),
  • Array H of 32 bit unsigned values,
  • Array M of 8 bit unsigned values,
  • 16 bit program counter PC.


In HCOMP, the first n elements of H are the contexts for the first n components described in the COMP section. In PCOMP, H has no special meaning. The sizes of H and M are given in the block header. The sizes are powers of 2.

The A register is used as an accumulator. All binary arithmetic and logic operations except assignment take A and one other register or 8 bit immediate value and put the result in A. The other operand may be A, B, C, D, *B, *C, *D, or a constant in the range 0 to 255. Assignment and unary operators may update any of these registers. *B and *C mean the element of M indexed by the low bits of B or C (depending on the size of M). *D means the element of H indexed by the low bits of D.

The binary operators are = += -= *= /= %= &= &~ ^= |= <<= >>=. These have mostly the same meanings as in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

. For example, "A+=B" means add B to A. The operator &~ means &=~ as in "A&~B" means "A &= ~B". Division (/=) and mod (%=) by 0 are defined to give 0. Shifts (>>=, <<=) are explicitly defined to use only the low 5 bits of the right operand so that "A<<= 33" is the same as "A<<= 1".

Unary operators and assignment may put the result in A, B, C, D, *B, *C, or *D. The operators are:
  • B++ means add 1 to B.
  • B-- means subtract 1 from B.
  • B! means B = ~B (complement the bits of B).
  • B=0 is a 1 byte code that clears B.
  • B<>A means swap B with A. The right operand must be A and the left operand cannot be A. If the operands are different sizes then only the low bits are updated. Thus, *B<>A swaps the low 8 bits of A.


Comparison operators are , <, >. Comparison is unsigned. The left operand must be A as with other binary operators. The result is put in F as 1 if true or 0 if false. Other instructions are as follows. n is a one byte unsigned operand (0..255), or a signed operand (-128..127) in the case of jump instructions.
  • JT n (jump if F true: PC ← PC + n)
  • JF n (jump if F false: PC ← PC + n)
  • JMP n (jump: PC ← PC + n)
  • LJ n m (long jump: PC ← n + 256*m)
  • A=R n (assign register Rn (R0..R255) to A, B, C, or D)
  • R=A n (assign A to register R0..R255).
  • HASH (A ← (A + *B + 512) * 773)
  • HASHD (*D ← (*D + A + 512) * 773)
  • OUT (output A)
  • HALT (return)
  • ERROR (invalid opcode 0)


Most opcodes are one byte. Opcodes that take a numeric argument are 2 bytes except for the LJ instruction, which is 3 bytes.

The JT, JF, and JMP instruction operands are relative to the next instruction. Thus, JMP 0 simply advances to the next instruction and JMP -2 is an infinite loop. Offsets are in bytes, not instructions.

The OUT instruction outputs the low 8 bits of A to the decompressed data stream in PCOMP. In HCOMP it has no effect.

HCOMP and PCOMP are each called once per encoded or decoded byte with that input byte in the A register and PC = 0 (first instruction). PCOMP is called once more at end of segment with A = 232 - 1. All other machine state is preserved between calls. The initial state before the first call is for all values set to 0. It is an error to execute an undefined opcode or ERROR, or for PC to go outside the range of the program.

In a ZPAQ block, HCOMP is stored uncompressed. PCOMP, if present, is compressed concatenated to the beginning of the first segment. If not present, a 0 byte is compressed to indicate it is absent. Otherwise a 1, a 2 byte length, and the PCOMP code is compressed. The sizes of H and M in both models are stored uncompressed in the block header so that archivers can determine memory requirements without decompressing.
Implementations
The following programs are ZPAQ level 1 compliant.
  • unzpaq is the reference decoder. It will decompress or list the contents of an archive but not compress. It will check the SHA-1 checksum if present, and print a warning if it computes a different value. If filenames are stored and not overridden by the user, then unzpaq will create the named files. Otherwise the user must supply the output filename. Unnamed segments, if not overridden, are concatenated to the last named file. The ZPAQL code is interpreted.

  • zpaq has all the functionality of unzpaq. In addition, it will compress files according to a supplied configuration file. If the configuration uses post-processing, then the user must supply an external preprocessor. It includes a debugging environment which allows a user to run the HCOMP or PCOMP section of a configuration file separately, single step a program (showing register contents), or bypass the post-processor during decompression. ZPAQL code can either be interpreted or translated to C++, and if a C++ compiler is installed, compile and run the translated code. This is about twice as fast as interpreting the code for either compression or decompression. zpaq allows the user to selectively include filenames (with or without paths), comments (file size) and checksums in the archive. The compressor output is a single block. To make archives with multiple blocks it must be run multiple times. The program will also create self extracting archives.

  • zpipe compresses or decompresses from standard input to standard output. It compresses using the mid model, although it can read any model like any compliant decompresser. If file names are present in the archive then they are ignored and all of the output is concatenated.

  • zp compresses in 3 different configurations (fast, mid, max) using compiled ZPAQL models. Thus, compression runs as fast as with zpaq but no C++ compiler is required. If an archive is compressed in any of these configurations, then zp will recognize it and use the compiled models to decompress them. Otherwise it decompresses using a slower interpreted model.

  • zpaqsfx is a stub for creating self extracting archives. If any compressor appends its output to a copy of the stub, then executing the stub will list the contents of the compressed data and instruct the user how to run the program again to decompress them. The stub interprets the ZPAQL code. Any compliant decompresser can also decompress the data without running the stub. The stub uses a 13 byte string (a locator tag) to identify the end of its own code and the start of the compressed data. Compliant decompressers will recognize locator tags, which allow compressed data to be mixed with other types of data without causing read errors by skipping any data not recognized.


Only zpaq reads configuration files. Newer versions support if-else-endif and do-while statements, which the program translates into jump instructions. It supports up to 9 numeric arguments which can be added to values in the code. Code may contain comments (in parentheses), is free form, and is not case sensitive. To make opcode lengths explicit, it requires that each byte be coded without spaces and be separated. Thus, "A=B" is a 1 byte instruction and "A= 123" is a 2 byte instruction. The spaces must be written exactly as shown to emphasize that 123 is the second byte.

zpaq, zp, and zpipe are implemented using libzpaq, a public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...

 library API written in C++. The library supplies functions to read and write ZPAQ byte streams, which can be files or data structures in memory.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK