All Topics  
ZIP (file format)

 

   Email Print
   Bookmark   Link






 

ZIP (file format)



 
 
The ZIP file format
File format

A file format is a particular way to encode information for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa....
 is a data compression
Data compression

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits than an code representation would use through use of specific encoding schemes....
 and archive
File archiver

A file archiver is a computer program that combines a number of computer file together into one archive file, or a series of archive files, for easier transportation or storage....
 format
File format

A file format is a particular way to encode information for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa....
. A ZIP file contains one or more files that have been compressed to reduce file size, or stored as-is. The process of compressing the document(s) into a zip file is also known as deep-packing. The ZIP file format permits a number of compression algorithms, but as of 2009, only Deflate
DEFLATE

Deflate is a lossless data compression algorithm that uses a combination of the LZ77 and LZ78 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool, and was later specified in RFC 1951....
 is widely used and supported.

The format was originally created in 1989 by Phil Katz
Phil Katz

Phillip Walter Katz , better known as Phil Katz, was a computer programmer best-known as the creator of the ZIP file format for Data compression, and the author of PKZIP, a program for creating zip files which ran under MS-DOS....
 for PKZIP
PKZIP

PKZIP is an archiving tool originally written by Phil Katz and marketed by his company PKWARE, Incorporation PKZIP is an acronym for Phil Katz's ZIP program....
, and evolved from the previous ARC
ARC (file format)

ARC is a lossless data compression and file archiver file format by . It was very popular during the early days of networked Bulletin board system....
 compression format by Thom Henderson.






Discussion
Ask a question about 'ZIP (file format)'
Start a new discussion about 'ZIP (file format)'
Answer questions from other users
Full Discussion Forum



Encyclopedia


The ZIP file format
File format

A file format is a particular way to encode information for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa....
 is a data compression
Data compression

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits than an code representation would use through use of specific encoding schemes....
 and archive
File archiver

A file archiver is a computer program that combines a number of computer file together into one archive file, or a series of archive files, for easier transportation or storage....
 format
File format

A file format is a particular way to encode information for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa....
. A ZIP file contains one or more files that have been compressed to reduce file size, or stored as-is. The process of compressing the document(s) into a zip file is also known as deep-packing. The ZIP file format permits a number of compression algorithms, but as of 2009, only Deflate
DEFLATE

Deflate is a lossless data compression algorithm that uses a combination of the LZ77 and LZ78 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool, and was later specified in RFC 1951....
 is widely used and supported.

The format was originally created in 1989 by Phil Katz
Phil Katz

Phillip Walter Katz , better known as Phil Katz, was a computer programmer best-known as the creator of the ZIP file format for Data compression, and the author of PKZIP, a program for creating zip files which ran under MS-DOS....
 for PKZIP
PKZIP

PKZIP is an archiving tool originally written by Phil Katz and marketed by his company PKWARE, Incorporation PKZIP is an acronym for Phil Katz's ZIP program....
, and evolved from the previous ARC
ARC (file format)

ARC is a lossless data compression and file archiver file format by . It was very popular during the early days of networked Bulletin board system....
 compression format by Thom Henderson. However, many software utilities other than PKZIP itself are now available to create, modify, or open (unzip, decompress) ZIP files, notably WinZip
WinZip

WinZip is a proprietary file archiver for Microsoft Windows, developed by WinZip Computing . It natively uses the PKZIP format but also has various levels of support for other List of archive formats....
, BOMArchiveHelper, StuffIt
StuffIt

StuffIt is a family of computer software utilities for archiving and compressing computer file on the Macintosh and Microsoft Windows platforms: it was originally produced for the Macintosh....
, KGB Archiver
KGB Archiver

KGB Archiver is a Free software, Open source software and cross-platform file archiver and data compression utility developed by Tomasz Pawlak based on the PAQ compression Algorithm....
, PicoZip
PicoZip

PicoZip is a proprietary file archiver and data compression utility.PicoZip is used to create ACE, ARC, ARJ, BH, CAB, GZ, JAR, LHA, LZH, RAR, TAR, TGZ, WAR, Z, ZIP, ZOO, MIM, XXE and UUE compressed folders....
, Info-ZIP
Info-ZIP

Info-ZIP is an open source version of Phil Katz's "DEFLATE" and "inflate" routines used in his popular Data compression program, PKZIP. The free code released by the Info-ZIP project under a BSD license spawned a horde of PKZIP imitators , establishing the ZIP as a de facto industry standard....
, WinRAR
WinRAR

WinRAR is a shareware file archiver and data compression utility developed by Eugene Roshal, and first released around 1998. It is one of the few applications that is able to create RAR archives natively, as the encoding method is held to be proprietary....
, IZArc
IZArc

IZArc is a proprietary file archiver for Microsoft Windows developed by Bulgarian programmer Ivan Zahariev. The program is freeware, but not open source....
, 7-Zip
7-Zip

7-Zip is an open source file archiver designed originally for Microsoft Windows. 7-Zip operates primarily with the 7z archive format, as well as being able to read and write to several other archive formats....
, ALZip
ALZip

ALZip is an archive and compression utility from ESTsoft for Microsoft Windows. It supports 36 compression and archive formats including ZIP ....
, TUGZip
TUGZip

TUGZip is a freeware file archiver for Microsoft Windows. It handles a great variety of List of archive formats, including some of the commonly used ones like ZIP , RAR, Gzip#File_format, bzip2, SQX and 7z....
, PeaZip
PeaZip

PeaZip is a file manager and file archiver for Microsoft Windows and Linux. It supports its native PEA archive format and other mainstream formats, with special focus on handling open formats....
, ZipGenius
ZipGenius

ZipGenius is a freeware file archiver developed by Matteo Riso of M.Dev Software for Microsoft Windows. It is capable of handling nearly two dozen file formats, including all the most common formats, as well as password-protect archives and work directly with CD-R/RW drives....
, and Universal Extractor. Microsoft has included built-in ZIP support (under the name "compressed folders") in versions of its Windows
Microsoft Windows

Microsoft Windows is a series of software operating systems and graphical user interfaces produced by Microsoft. Microsoft first introduced an operating environment named Windows in November 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces ....
 operating system
Operating system

An operating system is an interface between hardware and applications; it is responsible for the management and coordination of activities and the sharing of the limited resources of the computer....
 since 1998. Apple has included built-in ZIP support in Mac OS X 10.3 and Mac OS X 10.4 via the BOMArchiveHelper utility, now called Archive Utility
Archive utility

* For archive utility applications in general see file archiver.* For the MAC OS Archive Utility service application see Archive Utility....
 in Mac OS X 10.5 . The zip, zipcloak, zipnote, zipsplit tools are used widely in unix-like
Unix-like

A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
 systems.

ZIP files generally use the file extensions ".zip" or ".ZIP" and the MIME
MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of electronic mail to support:* Text in character sets other than ASCII...
 media type application/zip. Some software uses the ZIP file format as a wrapper for a large number of small items in a specific structure; when this is done a different file extension is usually used. Examples of this usage are Java
Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java ....
 JAR
JAR (file format)

In computing, a JAR file aggregates many files into one. Software developers generally use .jar files to distribute Java platform class and associated metadata....
 files, Python
Python (programming language)

Python is a general-purpose high-level programming language. Its design philosophy emphasizes code readability. Python's core syntax and semantics are Minimalism , while the standard library is large and comprehensive....
 .egg files, SilverLight .xap files, id Software
Id Software

id Software is an American video game developer from Mesquite, Texas. The company was founded in 1991 by four members of the computer company Softdisk: game programmer John D....
 .pk3/.pk4 files, package files for StepMania
StepMania

StepMania is an open source and cross-platform Music video game#Rhythm games and game engine. It was originally developed as a simulator of Konami's popular arcade game series Dance Dance Revolution, and has since evolved into an extensible rhythm game engine capable of supporting a wide variety of rhythm-based game types....
 and Winamp
Winamp

Winamp is a proprietary software media player written by Nullsoft, now a subsidiary of Time Warner. It is skin nable, multi-format freeware/shareware....
/Windows Media Player
Windows Media Player

Windows Media Player is a digital media media player and media library application developed by Microsoft that is used for playing sound reproduction, video and viewing s on personal computers running the Microsoft Windows operating system, as well as on Pocket PC and Windows Mobile-based devices....
 skins, XPInstall
XPInstall

XPInstall is a technology used by the Mozilla Application Suite, SeaMonkey, Mozilla Firefox, Mozilla Thunderbird and other XUL-based applications for installing Add-on that add functionality to the main application....
, as well as OpenDocument
OpenDocument

The OpenDocument format is a file format for electronic office documents such as spreadsheets, charts, presentation programs and word processor documents....
 and Office Open XML office formats. Both OpenDocument and Office Open XML formats use the JAR file format internally, so files can be easily uncompressed and compressed using tools for ZIP files. Google Earth
Google Earth

Google Earth is a virtual globe, map and geographic information program that was originally called Earth Viewer, and was created by Keyhole, Inc, a company acquired by Google in 2004....
 makes use of KMZ files, which are just KML
Keyhole Markup Language

Keyhole Markup Language is an XML-based Markup language schema for expressing geographic annotation and visualization on existing or future World Wide Web-based, two-dimensional maps and three-dimensional Earth browsers....
 files in ZIP format. Mozilla Firefox Add-ons are zip files with extension "xpi". Nokia and Sony Ericsson mobile phone themes are zipped with extension "nth" and "thm", respectively.

History


Early history


During the mid-1980s, System Enhancement Associates, a small company run by Thom Henderson, created a file archiving format called ARC
ARC (file format)

ARC is a lossless data compression and file archiver file format by . It was very popular during the early days of networked Bulletin board system....
, and a corresponding archiver (also called ARC) that could compress and decompress files into this format. This program was released as shareware
Shareware

The term shareware, popularized by Bob Wallace, refers to copyrighted commercial software that is distributed without payment on a trial basis and is limited by any combination of functionality, availability, or convenience....
 for a number of platforms, with the source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
 included. The file format quickly became a de facto standard
De facto

De facto is a Latin expression that means "concerning the fact" or in practice but not necessarily ordained by law. It is commonly used in contrast to de jure when referring to matters of law, governance, or technique that are found in the common experience as created or developed without or contrary to a regulation....
. Phil Katz released a file compatible software package on the IBM Intel DOS platform, known as PKXARC. It used hand-optimized 8088
Intel 8088

The Intel 8088 is an Intel x86 microprocessor based on the Intel 8086, with 16-bit registers and an 8-bit external data bus. It can address up to 1 megabyte of random access memory....
 assembly language
Assembly language

An assembly language is a low-level language for programming computers. It implements a symbolic representation of the numeric machine codes and other constants needed to program a particular CPU architecture....
 and was considerably faster than SEA's original cross-platform implementation in C
C (programming language)

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
.

The competition from Katz did not please SEA, who sued Katz for trademark and copyright infringement, as it alleged that Katz had plagiarized sections of the code. Katz lost the lawsuit and was forced to pay $62,500 to SEA to cover their legal fees. It was found during the court case that Katz had used SEA's ARC source code for the majority of the application but had only made code optimizations to increase speed. Primarily he changed the word length used by the algorithm from 12 bits to 13 bits resulting in a higher compression for typical binary files. As a result of the lawsuit, Katz changed the names of his utilities to PKPAK and PKUNPAK.

Katz then went on to create his own file format, which is known worldwide now as the ZIP format (commonly called a "ZIP file"). The ZIP format was more resistant to data loss than the ARC format because of redundant catalog storage. It also was more flexible than ARC, providing room for additional optional compression algorithms and future expansion. Along with the new format, PKZIP included at least one compression algorithm more efficient than any supported by ARC. Once PKZIP was released, many users abandoned ARC because of its slower speed and less effective compression, and because SEA alienated many by seeming to suddenly assert proprietary legal rights over the ARC file format after it had become widely used among the on-line community (similar in this respect to the later GIF patents controversy).

Katz publicly released technical documentation on the ZIP file format making it an open format
Open format

An open format is a published specification for storing digital data, usually maintained by a standards organization, which basically can be used and implemented by anyone....
, along with the first version of his PKZIP archiver, in January 1989. Originally only bundled with registered versions of PKZIP, the APPNOTE.TXT documentation file, titled .ZIP File Format Specification, was later available on the PKWARE site.

The name zip (meaning speed) was suggested by Katz's friend, Robert Mahoney. They wanted to imply that their product would be faster than ARC and other compression formats of the time.

Beyond the command line


In the mid 1990s, as more new computers included graphical user interface
Graphical user interface

A graphical user interface is a type of user interface which allows people to human-computer interaction such as computers; hand-held devices such as MP3 Players, Portable Media Players or Gaming devices; household appliances and office equipment....
s, fewer users were comfortable with the command-line
Command line interface

A command-line interface is a mechanism for interacting with a computer operating system or software by typing commands to perform specific tasks....
 operation of PKZIP. Seeing an opportunity, shareware
Shareware

The term shareware, popularized by Bob Wallace, refers to copyrighted commercial software that is distributed without payment on a trial basis and is limited by any combination of functionality, availability, or convenience....
 authors began pitching compression and archival programs with graphical user interfaces, with many of these using the ZIP format; WinZip
WinZip

WinZip is a proprietary file archiver for Microsoft Windows, developed by WinZip Computing . It natively uses the PKZIP format but also has various levels of support for other List of archive formats....
 was among the most popular. PKWARE also offered a graphical version of PKZIP. These programs were easier to learn than the older command-line equivalents, but users still had to learn a specialized tool with its own interface for file archival and compression.

In the late 1990s, various file managing
File manager

A file manager or file browser is a computer program that provides a user interface to work with file systems. The most common operations used are create, open, edit, file viewer, computer printer, streaming media, rename, move, file copying, file deletion, attributes, properties, search/find, and permissions....
 software started integrating support for the ZIP format into their user interface. Even earlier, Norton Commander
Norton Commander

Norton Commander was a prototypical orthodox file manager , written by John Socha and released by Peter Norton Computing . NC is a file manager which essentially acts as the text user interface for DOS....
 and its clones
Clone (computer science)

In computing, a clone is a computer hardware or software system that is designed to mimic another system. computer compatibility with the original system is usually the explicit purpose of cloning hardware or low-level software such as operating systems....
 like Volkov Commander
Volkov Commander

Volkov Commander is an old Orthodox File Manager for DOS. Like other OFMs, Volkov Commander is also a dual-pane file manager in the close tradition of Norton Commander....
 in DOS
DOS

DOS, short for "Disk Operating System", is a shorthand term for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions Windows 95, Windows 98, and Windows Me....
 had started that trend, and that remains the norm for the "Commander-like" or orthodox file managers like Midnight Commander
Midnight Commander

GNU Midnight Commander is a free software cross-platform orthodox file manager and a clone of Norton Commander.Midnight Commander is a console application with a text user interface....
 (for Linux and UNIX-like systems) and Total Commander
Total Commander

Total Commander is a shareware Orthodox file manager for Microsoft Windows. Some features include a built-in File Transfer Protocol client, file compare, archive file navigation, and a multi-rename tool....
 (previously Windows Commander; for Windows). The KDE
KDE

KDE is a free software project based around its flagship product, a desktop environment for Unix-like systems. The goal of the project is to provide basic desktop functions and applications for daily needs as well as tools and documentation for developers to write stand-alone applications for the system....
 file manager (kfm) supported the ZIP format very early; ZIP support was also first added to Windows Explorer
Windows Explorer

Windows Explorer is a file manager application that is included with releases of the Microsoft Windows operating system from Windows 95 onwards....
 with the Plus!
Microsoft Plus!

Microsoft Plus! is a commercial operating system enhancement product by Microsoft. The last edition is the Plus! SuperPack, which includes an assortment of screensavers, themes, and games, as well as multimedia applications....
 enhancement package in Windows 98
Windows 98

Windows 98 is a graphical operating system released on 25 June 1998 by Microsoft and the successor to Windows 95. Like its predecessor, it is a hybrid 16-bit application/32-bit application monolithic product based on MS-DOS....
 and later included in Windows Me
Windows Me

Windows Millennium Edition, or Windows Me , is a hybrid 16-bit/32-bit graphical operating system released on 14 September 2000 by Microsoft....
, Windows XP
Windows XP

Windows XP is a line of operating systems produced by Microsoft for use on personal computers, including home and business desktops, laptop, and media centers....
, Windows Vista, and now Windows 7; ZIP format support is also built in the Mac OS Finder
Macintosh Finder

The Finder is the default application software program used on the Mac OS and Mac OS X operating systems that is responsible for the overall user-management of files, disks, network volumes and the launching of other applications....
 (as of Mac OS X
Mac OS X

Mac OS X is a line of computer operating systems developed, marketed, and sold by Apple Inc., and since 2002 has been included with all new Macintosh computer systems....
, via the BOMArchiveHelper utility), the Nautilus
Nautilus (file manager)

Nautilus is the official file manager for the GNOME desktop. The name is a play on words, evoking the animal shell of a nautilus to represent an shell ....
 file manager used by GNOME
Gnome

A gnome is a mythical creature characterized by its extremely small size and wiktionary:subterranean lifestyle. The word gnome is derived from the New Latin gnomus....
, and the Konqueror
Konqueror

Konqueror is a web browser, file manager and file viewer designed as a core part of the KDE. It is developed by volunteers and can run on most Unix-like operating systems....
 file manager of newer versions of KDE. By 2002, all major desktop environment
Desktop environment

In graphical computing, a desktop environment commonly refers to a style of graphical user interface that is based on the desktop metaphor which can be seen on most modern personal computers today....
s included ZIP file support in their file managers: a ZIP file is typically presented as a directory or folder, so that files are copied into and out of it in the same manner as any other folder and the compression is handled in a way largely transparent to the user. This has eliminated the need to learn a specialized tool and interface for file archival and compression.

As well, ZIP files can be processed on iPhones or mobile devices based on Windows Mobile.

The ZIP format is ubiquitous.

Confusion among formats

There are numerous standards and formats dealing with compression, and people sometimes confuse or conflate them. For example, ZIP is distinct from GZIP; the former is a standard owned by the PKWare Company, and the latter is defined in an IETF RFC (1952). Both ZIP and GZIP primarily use the DEFLATE algorithm for compression. Likewise, the ZLIB format (IETF RFC 1950) also uses the DEFLATE compression algorithm, but specifies different headers for error and consistency checking.

Version history


The .ZIP File Format Specification has its own version number, which does not necessarily correspond to the version numbers for the PKZIP tool, especially with PKZIP 6 or later. At various times, PKWARE adds preliminary features that allows PKZIP products to extract archives using advanced features, but PKZIP products that create such archives won't be available until the next major release. Other companies or organizations support the PKWARE specifications at their own pace.

A summary of key advances in various versions of the PKWARE spec:

  • 2.0: File entries can be compressed with DEFLATE.


  • 4.5: Documented 64-bit ZIP format.


  • 5.0: DES, 3DES, RC2, RC4 supported for encryption


  • 5.2: RC2-64 supported for Encryption.


  • 6.1: Documented certificate storage.


  • 6.2.0: Documented Central Directory Encryption.


  • 6.3.0: Documented Unicode (UTF-8) filename storage. Expanded list of supported hash, compression, encryption algorithms.


  • 6.3.1: Corrected standard hash values for SHA-256/384/512.


  • 6.3.2: Documented compression method 97 (WavPack
    WavPack

    WavPack is a free software, open source Audio compression #Lossless audio compression file format developed by David Bryant....
    ).


Technical information


ZIP is a simple archive format that compresses every file separately. Compressing files separately allows for individual files to be retrieved without reading through other data; in theory, it may allow better compression by using different algorithms for different files. A caveat to this is that archives containing a large number of small files end up significantly larger than if they were compressed as a single file (the classic example of the latter is the common tar.gz
Tar (file format)

In computing, tar is both a file format and the name of the program used to handle such files. The format was created in the early days of Unix and standardized by POSIX.1-1988 and later POSIX.1-2001....
archive which consists of a TAR
Tar (file format)

In computing, tar is both a file format and the name of the program used to handle such files. The format was created in the early days of Unix and standardized by POSIX.1-1988 and later POSIX.1-2001....
 archive compressed using gzip
Gzip

gzip is a software application used for file compression. gzip is short for GNU zip; the program is a free software replacement for the compress program used in early Unix systems, intended for use by the GNU Project....
).

The specification for ZIP indicates that files can be stored either uncompressed or using a variety of compression algorithms, but ZIP is generally used with Katz's
Phil Katz

Phillip Walter Katz , better known as Phil Katz, was a computer programmer best-known as the creator of the ZIP file format for Data compression, and the author of PKZIP, a program for creating zip files which ran under MS-DOS....
 DEFLATE
DEFLATE

Deflate is a lossless data compression algorithm that uses a combination of the LZ77 and LZ78 algorithm and Huffman coding. It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool, and was later specified in RFC 1951....
 algorithm
Algorithm

In mathematics, computing, linguistics and related subjects, an algorithm is a sequence of finite instructions, often used for calculation and data processing....
, except when files being added are already compressed or are resistant to compression, in which case the file data is simply STORED uncompressed.

ZIP supports a simple password
Password

A password is a secret word or string of Character that is used for authentication, to prove identity or gain access to a resource . The password must be kept Secrecy from those not allowed access....
-based symmetric encryption
Symmetric-key algorithm

Symmetric-key algorithms are a class of algorithms for cryptography that use trivially related, often identical, cryptographic keys for both decryption and encryption....
 system which is documented in the ZIP spec, and known to be seriously flawed. In particular it is vulnerable to known-plaintext attack
Known-plaintext attack

The known-plaintext attack is an attack model for cryptanalysis where the attacker has samples of both the plaintext and its encryption version and is at liberty to make use of them to reveal further secret information such as Cryptographic key and Code book....
s which are in some cases made worse by poor implementations of random number generators. The ZIP spec also supports spreading archives across multiple filesystem files. Originally intended for storage of large zip files across multiple 1,44mb floppy disk
Floppy disk

A floppy disk is a data storage medium that is composed of a disk of thin, flexible magnetic storage medium encased in a square or rectangle plastic shell....
s, this feature is now used for sending zip archives in parts over email, or over other transports or removable media).

New features including new compression
Data compression

In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits than an code representation would use through use of specific encoding schemes....
 and encryption
Encryption

In cryptography, encryption is the process of transforming information using an algorithm to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key ....
 (e.g. AES
Advanced Encryption Standard

In cryptography, the Advanced Encryption Standard is an encryption standard adopted by the Federal government of the United States. The standard comprises three block ciphers, AES-128, AES-192 and AES-256, adopted from a larger collection originally published as Rijndael. Each AES cipher has a 128 bit block size, with key sizes of 128...
) methods have been documented to .ZIP File Format Specification since version 5.2. A WinZip-developed AES-based standard is used also by 7-Zip
7-Zip

7-Zip is an open source file archiver designed originally for Microsoft Windows. 7-Zip operates primarily with the 7z archive format, as well as being able to read and write to several other archive formats....
, XCeed, and DotNetZip, but some vendors use other formats. PKWARE SecureZIP also supports DC2, DC4, DES, 3DES encryption methods, Digital Certificate-based encryption and authentication (X.509
X.509

In cryptography, X.509 is an ITU-T standard for a public key infrastructure for single sign-on and Privilege Management Infrastructure . X.509 specifies, amongst other things, standard formats for public key certificates, certificate revocation lists, attribute certificates, and a certification path validation algorithm....
), and archive header encryption.

The original ZIP format had a 4.2gb limit on various things (uncompressed size of a file, compressed size of a file and total size of the archive), as well as a limit of 65535 entries in a zip archive. In version 4.5 of the specification (which is not the same as v4.5 of any particular tool), PKWARE introduced the "ZIP64" format extensions to get around these limitations. Zip64 support is emerging. For example, the File Explorer in Windows XP does not support ZIP64, but the Explorer in Windows Vista does. Likewise - some libraries, such as IOCompressZip in Perl, have new support for ZIP64, while others, such as Java's built-in java.util.zip, still lack it.

The FAT filesystem
File Allocation Table

File Allocation Table or FAT is a computer file system architecture now widely used on most computer systems and most memory cards, such as those used with digital cameras....
 of DOS only has a timestamp resolution of two seconds; ZIP file records mimic this. As a result, the built-in timestamp resolution of files in a ZIP archive is only two seconds, though extra fields can be used to store more accurate timestamps.

Since September 2007, the ZIP specification (APPNOTE.TXT) contains a provision to store file names using UTF-8
UTF-8

UTF-8 is a Variable-width encoding character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backward compatibility with ASCII....
, finally adding Unicode compatibility to ZIP.

Not all of the zip features are implemented by the all the various libraries and zip toolkits.

The Structure of a ZIP file


The ZIP file contents are files and directories which are stored in arbitrary order. The location of a file is indicated in the so called central directory which is located at the end of the ZIP file. The files and directories are represented by file entries.

Each file entry is introduced by a local header with information about the file such as the comment, file size and file name, followed by optional "Extra" data fields, and then the possibly compressed, possibly encrypted file data. The "Extra" data fields are the key to the extensibility of the ZIP format. It is the "Extra" fields that are exploited to support ZIP64 formats, WinZip-compatible AES encryption, and NTFS file timestamps. In theory there are many other extensions possible via this coded "extra" field.

The central directory consists of file headers holding, among other metadata, the file names and the relative offset in the archive of the local headers for each file entry.

Each file entry is marked by a specific 4-byte "signature"; each entry in the central directory is likewise marked with a different particular 4-byte signature. ZIP file parsers typically look for the appropriate signatures when parsing a ZIP file.

Due to the fact that the order of the file entries in the directory need not conform to the order of file entries in the archive, the format is non-sequential.

There is no BOF or EOF marker in the ZIP spec. Instead, ZIP tools scan for the signatures of the various fields.

Combining ZIP with other file formats


The ZIP file format allows for a comment containing any data to occur at the end of the file after the central directory. Also, because the central directory specifies the offset of each file in the archive with respect to the start, it is possible in practice for the first file entry to start at an offset other than zero.

This allows arbitrary data to occur in the file both before and after the ZIP archive data, and for the archive to still be read by a ZIP application. A side-effect of this is that it is possible to author a file that is both a working ZIP archive and another format, provided that the other format tolerates arbitrary data at its end, beginning, or middle. Self-extracting archives (SFX
SFX

SFX may refer to:* Special effect, illusions used in film, television and entertainment* Sound effect, artificially created or enhanced sounds...
), of the form supported by WinZip and DotNetZip, take advantage of this - they are .exe files that conform to the PKZIP AppNote.txt specification and can be read by compliant zip tools or libraries.

This property of the ZIP format, and of the JAR format which is a variant of ZIP, can be exploited to hide harmful Java classes inside a seemingly harmless file, such as a GIF image uploaded to the web. This so-called GIFAR exploit has been demonstrated as an effective attack against web applications such as Facebook.

Implementing a ZIP application


There are numerous ZIP tools available, and numerous ZIP libraries for various programming environments. Some of the libraries are commercial, some are not. Some are open source, some are not. WinZip is perhaps the most popular and famous ZIP tool - it runs primarily on Windows and is a user tool for creating or extracing ZIP files. WinRAR, IZarc, Info-zip, 7-zip are other tools, available on various platforms. Some of those tools have library or programmatic interfaces.

There are some useful development libraries which are available as open source contributions such as the GNU
GNU

GNU is a computer operating system composed entirely of free software. Its name is a recursive acronym for GNU's Not Unix; it was chosen because its design is Unix-like, but differs from Unix by being free software and containing no Unix code....
 gzip
Gzip

gzip is a software application used for file compression. gzip is short for GNU zip; the program is a free software replacement for the compress program used in early Unix systems, intended for use by the GNU Project....
 project and Info-ZIP
Info-ZIP

Info-ZIP is an open source version of Phil Katz's "DEFLATE" and "inflate" routines used in his popular Data compression program, PKZIP. The free code released by the Info-ZIP project under a BSD license spawned a horde of PKZIP imitators , establishing the ZIP as a de facto industry standard....
. For Java, there are a few options: Java Platform, Standard Edition
Java Platform, Standard Edition

Java Platform, Standard Edition or Java SE is a widely used Platform for programming in the Java language. It is the Java Platform used to deploy porting Application software for general use....
 contains the package "java.util.zip to handle standard zip files; the Zip64File library specifically supports large files (larger than 4GB) and treats ZIP files using random access; and the Apache Ant
Apache Ant

Apache Ant is a software tool for build automation processes. It is similar to make but is implemented using the Java language, requires the Java platform, and is best suited to building Java projects....
 tool contains a more complete implementation released under the Apache Software License.

For .NET applications, there is a no-cost library called DotNetZip available in source and binary form under the Microsoft Public License . It does passwords for symmetric ZIP encryption, Unicode, ZIP64, and WinZip-compatible AES encryption. The Microsoft .NET 3.5
.NET Framework

The Microsoft .NET Framework is a software framework that is available with several Microsoft Windows operating systems. It includes a large Library of coded solutions to prevent common programming problems and a virtual machine that manages the execution of programs written specifically for the Software framework....
 runtime library includes a class System.IO.Packaging.Package that supports the ZIP format, but it is primarily designed for Microsoft's document formats (xlsx, pptx, docx, xps), and is somewhat unnatural to use for generic zip files.

The Info-ZIP
Info-ZIP

Info-ZIP is an open source version of Phil Katz's "DEFLATE" and "inflate" routines used in his popular Data compression program, PKZIP. The free code released by the Info-ZIP project under a BSD license spawned a horde of PKZIP imitators , establishing the ZIP as a de facto industry standard....
 implementations of the ZIP format adds support for Unix filesystem features, such as user and group IDs, file permissions, and support for symbolic links. The Apache Ant
Apache Ant

Apache Ant is a software tool for build automation processes. It is similar to make but is implemented using the Java language, requires the Java platform, and is best suited to building Java projects....
 implementation is aware of these to the extent that it can create files with predefined Unix permissions. The Info-ZIP implementations also know how to use the error correction capabilities built into the ZIP compression format. Some programs (such as IZArc
IZArc

IZArc is a proprietary file archiver for Microsoft Windows developed by Bulgarian programmer Ivan Zahariev. The program is freeware, but not open source....
) do not and will choke on a file that has errors.

The Info-ZIP Windows tools also support NTFS
NTFS

NTFS is the standard file system of Windows NT, including its later versions Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, and Windows 7....
 filesystem permissions, and will make an attempt to translate from NTFS permissions to Unix permissions or vice-versa when extracting files. This can result in potentially unintended combinations, e.g. .exe files being created on NTFS volumes with executable permission denied.

Strong encryption controversy


When WinZip
WinZip

WinZip is a proprietary file archiver for Microsoft Windows, developed by WinZip Computing . It natively uses the PKZIP format but also has various levels of support for other List of archive formats....
 9.0 public beta was released in 2003, WinZip introduced its own AES-256 encryption, using a different file format, along with the documentation for the new specification. The encryption standards themselves were not proprietary
Proprietary

The word proprietary indicates that a party, or proprietor, exercises private ownership, control or use over an item of property.Terms relating to Proprietary include:...
, but PKWARE had not updated APPNOTE.TXT to include Strong Encryption Specification (SES) since 2001, which had been used by PKZIP versions 5.0 and 6.0. WinZip technical consultant Kevin Kearney and StuffIt
StuffIt

StuffIt is a family of computer software utilities for archiving and compressing computer file on the Macintosh and Microsoft Windows platforms: it was originally produced for the Macintosh....
 product manager Mathew Covington accused PKWARE of withholding SES, but PKZIP chief technology officer Jim Peterson claimed that Certificate-based encryption was still incomplete. However, the latest publicly available APPNOTE.TXT at the time was version 4.5 (available on PKWARE's FTP site), which not only omitted SES, but also omitted Deflate64, DCL Implode, BZip2 compression methods used by .ZIP files created by contemporary PKZIP products.

To overcome this shortcoming, contemporary products such as PentaZip 'implemented' strong ZIP encryption by encrypting ZIP archives into a different file format.

In another controversial move, PKWare applied for a patent in 2003-07-16 describing a method for combining .ZIP and strong encryption to create a secure .ZIP file.

In the end, PKWARE and WinZip agreed to support each other's products. On 2004-01-21, PKWARE announced the support of WinZip-based AES compression format. In later version of WinZip beta, it is able to support SES-based ZIP files. PKWARE eventually released version 5.2 of .ZIP File Format Specification to public, which documented SES.

See also


  • PKZIP
    PKZIP

    PKZIP is an archiving tool originally written by Phil Katz and marketed by his company PKWARE, Incorporation PKZIP is an acronym for Phil Katz's ZIP program....


  • List of archive formats
    List of archive formats

    This is a list of file formats used by file archivers and data compressions used to create archive files....


  • LZW compression method
    LZW

    Lempel-Ziv-Welch is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ77 and LZ78 algorithm published by Lempel and Ziv in 1978....


  • Comparison of file archivers
    Comparison of file archivers

    The following tables compare general and technical information for a number of file archivers. Please see the individual products' articles for further information....


External links

  • at The Data Compression News Blog
  • - reads and writes zip archives.