All Topics  
File format

 

   Email Print
   Bookmark   Link






 

File format



 
 
A file format is a particular way to encode information for storage in a computer file
Computer file

A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable computer storage....
.

Since a disk drive, or indeed any computer storage
Computer storage

Computer data storage, often called storage or memory, refers to computer components, devices, and recording medium that retain digital data used for computing for some interval of time....
, can store only bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
s, the computer must have some way of converting information
Information

Information as a Conveyed concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control system, data, form, instruction, knowledge, Meaning , stimulation, pattern, perception, and knowledge representation....
 to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor
Word processor

A word processor is a computer Application software used for the production of any sort of printable material.Word processor may also refer to an obsolete type of stand-alone office machine, popular in the 1970s and 80s, combining the keyboard text-entry and printing functions of an electric typewriter with a dedicated computer for th...
 documents, there will typically be several different formats. Sometimes these formats compete with each other.

file formats are designed to store very particular sorts of data: the JPEG
JPEG

In computing, JPEG is a commonly used method of for photographic images. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality....
 format, for example, is designed only to store static photographic image
Image

An image is an artifact, usually two-dimensional , that has a similar appearance to some subject —usually a physical object or a person....
s.






Discussion
Ask a question about 'File format'
Start a new discussion about 'File format'
Answer questions from other users
Full Discussion Forum



Encyclopedia


A file format is a particular way to encode information for storage in a computer file
Computer file

A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable computer storage....
.

Since a disk drive, or indeed any computer storage
Computer storage

Computer data storage, often called storage or memory, refers to computer components, devices, and recording medium that retain digital data used for computing for some interval of time....
, can store only bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
s, the computer must have some way of converting information
Information

Information as a Conveyed concept has a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control system, data, form, instruction, knowledge, Meaning , stimulation, pattern, perception, and knowledge representation....
 to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. Within any format type, e.g., word processor
Word processor

A word processor is a computer Application software used for the production of any sort of printable material.Word processor may also refer to an obsolete type of stand-alone office machine, popular in the 1970s and 80s, combining the keyboard text-entry and printing functions of an electric typewriter with a dedicated computer for th...
 documents, there will typically be several different formats. Sometimes these formats compete with each other.

Generality

Some file formats are designed to store very particular sorts of data: the JPEG
JPEG

In computing, JPEG is a commonly used method of for photographic images. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality....
 format, for example, is designed only to store static photographic image
Image

An image is an artifact, usually two-dimensional , that has a similar appearance to some subject —usually a physical object or a person....
s. Other file formats, however, are designed for storage of several different types of data: the GIF format supports storage of both still images and simple animations, and the QuickTime
QuickTime

QuickTime is a multimedia framework developed by Apple Inc., capable of handling various formats of digital video, media clips, sound, text, animation, music, and QuickTime VRs....
 format can act as a container
Container format

A container or wrapper format is a file format whose specifications regard only the way data are stored within the file, and how many metadata could or are effectively stored, whereas no specific codification of the data themselves is implied or specified....
 for many different types of multimedia
Multimedia

Multimedia is media and content that utilizes a combination of different content format. The term can be used as a noun or as an adjective describing a medium as having multiple content forms....
. A text file
Text file

A text file is a kind of computer file that is structured as a sequence of line . A text file exists within a computer file system. The end of a text file is often denoted by placing one or more special characters, known as an end-of-file marker, after the last line in a text file....
 is simply one that stores any text, in a format such as ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 or UTF-8
UTF-8

UTF-8 is a Variable-width encoding character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backward compatibility with ASCII....
, with few if any control character
Control character

In computing and telecommunication, a control Grapheme or non-printing character is a code point in a character encoding, that does not in itself represent a written symbol....
s. Some file formats, such as HTML
HTML

HTML, an Acronym and initialism of HyperText Markup Language, is the predominant markup language for Web pages. It provides a means to describe the structure of text-based information in a document?by denoting certain text as links, headings, paragraphs, lists, and so on?and to supplement that text with interactive forms, embedded '...
, or the source code
Source code

In computer science, source code is any collection of statements or declarations written in some human-readable computer programming language....
 of some particular programming language, are in fact also text files, but adhere to more specific rules which allow them to be used for specific purposes.

Specifications

Many file formats, including some of the most well-known file formats, have a published specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular program
Computer program

Computer programs are Instruction for a computer. A computer requires programs to function. Moreover, a computer program does not run unless its instructions are executed by a Central processing unit; however, a program may communicate an Algorithm#Formalization of algorithms to people without running....
 treats a particular file format correctly. There are, however, two reasons why this is not always the case. First, some file format developers view their specification documents as trade secret
Trade secret

A trade secret is a formula, Best practice, process, design, Legal instrument, pattern, or compilation of information which is not generally known or reasonably ascertainable, by which a business can obtain an economic advantage over competitors or customers....
s, and therefore do not release them to the public. Second, some file format developers never spend time writing a separate specification document; rather, the format is defined only implicitly, through the program(s) that manipulate data in the format.

Using file formats without a publicly available specification can be costly. Learning how the format works will require either reverse engineering
Reverse engineering

Reverse engineering is the process of discovering the technological principles of a device, object or system through analysis of its structure, function and operation....
 it from a reference implementation or acquiring the specification document for a fee from the format developers. This second approach is possible only when there is a specification document, and typically requires the signing of a non-disclosure agreement
Non-disclosure agreement

A non-disclosure agreement , also known as a confidentiality agreement, confidential disclosure agreement , proprietary information agreement , or secrecy agreement, is a law contract between at least two party that outlines confidential materials or knowledge the parties wish to share with one another for certain pur...
. Both strategies require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.

Patent
Patent

A patent is a set of exclusive rights granted by a state to an inventor or his assignee for a term of patent in exchange for a disclosure of an invention....
 law, rather than copyright
Copyright

Copyright is a form of intellectual property which gives the creator of an original work exclusive rights for a certain time period in relation to that work, including its publication, distribution and adaptation; after which time the work is said to enter the public domain....
, is more often used to protect a file format. Although patents for file formats are not directly permitted under US law, some formats require the encoding of data with patented algorithms. For example, using compression with the GIF file format requires the use of a patented algorithm, and although initially the patent owner did not enforce it, they later began collecting fees for use of the algorithm. This has resulted in a significant decrease in the use of GIF
GIF

The Graphics Interchange Format is a Raster graphics that was introduced by CompuServe in 1987 and has since come into widespread usage on the World Wide Web due to its wide support and portability....
s, and is partly responsible for the development of the alternative PNG format. However, the patent expired in the US in mid-2003, and worldwide in mid-2004. Algorithms are usually held not to be patentable under current European law, which also includes a provision that members "shall ensure that, wherever the use of a patented technique is needed for a significant purpose such as ensuring conversion of the conventions used in two different computer systems or networks so as to allow communication and exchange of data content between them, such use is not considered to be a patent infringement", which would apparently allow implementation of a patented file system where necessary to allow two different computers to interoperate.

Identifying the type of a file

Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within the filesystem—an example of metadata. Different operating system
Operating system

An operating system is an interface between hardware and applications; it is responsible for the management and coordination of activities and the sharing of the limited resources of the computer....
s have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages.

Of course, most modern operating systems, and individual applications, need to use all of these approaches to process various files, at least to be able to read 'foreign' file formats, if not work with them completely.

Filename extension

One popular method in use by several operating systems, including Mac OS X
Mac OS X

Mac OS X is a line of computer operating systems developed, marketed, and sold by Apple Inc., and since 2002 has been included with all new Macintosh computer systems....
, CP/M
CP/M

CP/M is an operating system originally created for Intel 8080/Intel 8085 based microcomputers by Gary Kildall of Digital Research. Initially confined to single tasking on 8-bit processors and no more than 64 kilobytes of memory, later versions of CP/M added multi-user variations, and were migrated to 16-bit processors....
, DOS
DOS

DOS, short for "Disk Operating System", is a shorthand term for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions Windows 95, Windows 98, and Windows Me....
, VMS
VMS

VMS may stand for:* OpenVMS or VAX/VMS, a server computer operating system* FreeVMS, a computer operating system* Variable-message sign, an electronic traffic sign often used on highways...
, VM/CMS, and Windows
Microsoft Windows

Microsoft Windows is a series of software operating systems and graphical user interfaces produced by Microsoft. Microsoft first introduced an operating environment named Windows in November 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces ....
, is to determine the format of a file based on the section of its name following the final period. This portion of the filename is known as the filename extension
Filename extension

A filename extension is a substring to the filename of a computer file applied to indicate the encoding convention of its contents.In some operating systems it is optional, while in some others it is a requirement....
. For example, HTML documents are identified by names that end with .html (or .htm), and GIF images by .gif. In the original FAT
File Allocation Table

File Allocation Table or FAT is a computer file system architecture now widely used on most computer systems and most memory cards, such as those used with digital cameras....
 filesystem, filenames were limited to an eight-character identifier and a three-character extension, which is known as 8.3 filename. Many formats thus still use three-character extensions, even though modern operating systems and application programs no longer have this limitation. Since there is no standard list of extensions, more than one format can use the same extension, which can confuse the operating system and consequently users.

One artifact of this approach is that the system can easily be tricked into treating a file as a different format simply by renaming it—an HTML file can, for instance, be easily treated as plain text by renaming it from filename.html to filename.txt. Although this strategy was useful to expert users who could easily understand and manipulate this information, it was frequently confusing to less technical users, who might accidentally make a file unusable (or 'lose' it) by renaming it incorrectly.

This led more recent operating system shells, such as Windows 95
Windows 95

Windows 95 is a consumer-oriented graphical user interface-based operating system. It was released on August 24, 1995 by Microsoft, and was a significant progression from the company's previous Microsoft Windows products....
 and Mac OS X
Mac OS X

Mac OS X is a line of computer operating systems developed, marketed, and sold by Apple Inc., and since 2002 has been included with all new Macintosh computer systems....
, to hide the extension when displaying lists of recognized files. This separates the user from the complete filename, preventing the accidental changing of a file type, while allowing expert users to still retain the original functionality through enabling the displaying of file extensions.

A downside of hiding the extension is that it then becomes possible to have what appears to be two or more identical filenames in the same folder. This is especially true when image files are needed in more than one format for different applications. For example, a company logo may be needed both in .tif format (for publishing) and .gif format (for web sites). With the extensions visible, these would appear as the unique filenames "CompanyLogo.tif" and "CompanyLogo.gif". With the extensions hidden, these would both appear to have the identical filename "CompanyLogo", making it more difficult to determine which to select for a particular application.

A further downside is that hiding such information can become a security risk. This is because on a system reliant on filename extensions all usable files will have such an extension (for example all JPEG images will have ".jpg" or ".jpeg" at the end of their name), so seeing file extensions would be a common occurrence and users may depend on them when looking for a file's format. By having file extensions hidden a malicious user can create an executable program
Trojan horse (computing)

The Trojan horse, also known as trojan, in the context of computer software, describes a class of computer threats that appears to perform a desirable function but in fact performs undisclosed malicious functions that allow unauthorized access to the host machine, giving them the ability to save their files on the user's computer...
 with an innocent name such as "Holiday photo.jpg.exe". In this case the ".exe" will be hidden and a user will see this file as "Holiday photo.jpg", which appears to be a JPEG image, unable to harm the machine save for bugs in the application used to view it. However, the operating system will still see the ".exe" extension and thus will run the program, which is then able to cause harm and presents a security issue. To further trick users, it is possible to store an icon inside the program, as done on Microsoft Windows, in which case the operating system's icon assignment can be overridden with an icon commonly used to represent JPEG images, making such a program look like and appear to be called an image, until it is opened that is. This issue requires users with extensions hidden to be vigilant, and never open files which seem to have a known extension displayed despite the hidden option being enabled (since it must therefore have 2 extensions, the real one being unknown until hiding is disabled). In reality this presents a problem for Windows systems where extension hiding is turned on by default.

Internal Metadata

A second way to identiy a file format is to store information regarding the format inside the file itself. Usually, such information is written in one (or more) binary string(s), tagged or raw texts placed in fixed, specific locations within the file. Since the easiest place to locate them is at the beginning of it, such area is usually called a file header when it is greater than a few bytes, or a magic number if it is just a few bytes long.

File Header
First of all, the metadata contained in a file header are not necessarily stored only at the beginning of it, but might be present in other areas too, often including the end of the file; that depends on the file format or the type of data it contains. Character-based (text) files have character-based (often even human-readable) headers, whereas binary formats usually feature binary headers, although that is not a rule: a human-readable file header might may require more bytes, but is easily discernable with simple text or hexadecimal editors. File headers may not only contain the information required by algorithms to identify the file format alone, but also real metadata about the file and its contents. For example most image file formats
Image file formats

Image file formats are standardized means of organizing and storing images. This entry is about digital image formats used to store photographic and other images; ....
 store information about image size, resolution, colour space/format and optionally other authoring
Authoring

Authoring may refer to:* Writing, as by an author.* Authoring tool.* Authoring systems.* Optical disc authoring and DVD authoring, the process of creating a DVD or a CD from multimedia source materials....
 information like who, when and where it was made, what camera model and shooting parameters was it taken with (if any, cfr. EXIF), and so on. Such metadata may be used by a program reading or interpreting the file both during the loading process and after that, but can also be used by the operative system to quickly capture information about the file itself without loading it all into memory.

The downward sides of file header as a file-format identification method are at least two. First: at least a few (initial) blocks of the file need to be read in order to gain such information; those could be fragmented
Fragmentation (computer)

In computer storage, fragmentation is a phenomenon in which storage space is used inefficiently, reducing storage capacity. The term is also used to denote the wasted space itself....
 in different locations of the same storage medium, thus requiring more seek and I/O time, which is particularly bad for the identification of large quantities of files altogether (like a GUI
Gui

Gui or guee is a generic term to refer to grillinged dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients....
 browsing inside a folder with thousands or more file and discerning file icons or thumbnails
Thumbnail

Thumbnails are reduced-size versions of pictures, used to help in recognizing and organizing them, serving the same role for images as a normal text index does for words....
 for all of them to visualize). Second: if the header is binary hard-coded (i.e. the header itself is subject to a non-trivial interpretation in order to be recognized), especially for metadata content protection's sake, there is some risk that file format is misinterpreted at first sight, or even badly written at the source, often resulting in corrupt metadata (which, in extremely pathological cases, might even render the file unreadable anymore).

A more logically sophisticated example of file header is that used in wrapper (or container) file formats.

Magic number
One way to incorporate such metadata, often associated with Unix
Unix

Unix is a computer operating system originally developed in 1969 by a group of American Telephone & Telegraph employees at Bell Labs, including Ken Thompson , Dennis Ritchie, Douglas McIlroy, and Joe Ossanna....
 and its derivatives, is just to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte
Byte

A byte is a basic unit of measurement of Computer storage in computer science. In many computer architectures it is a Byte addressing memory address space....
 identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII
ASCII

American Standard Code for Information Interchange , is a coding standard that can be used for interchanging information, if the information is expressed mainly by the written form of English words....
 representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition
Document Type Definition

Document Type Definition is one of several SGML and XML schema languages, and is also the term used to describe a document or portion thereof that is authored in the DTD language....
 that starts with <!DOCTYPE, or, for XHTML
XHTML

The Extensible Hypertext Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax....
, the XML identifier, which begins with <?xml. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML.

The magic number approach offers better guarantees that the format will be identified correctly, and can often determine more precise information about the file. Since reliable "magic number" tests can be fairly complex, and each file must effectively be tested against every possibility in the magic database, this approach is also relatively inefficient, especially for displaying large lists of files (in contrast, filename and metadata-based methods need check only one piece of data, and match it against a sorted index). Also, data must be read from the file itself, increasing latency as opposed to metadata stored in the directory. Where filetypes don't lend themselves to recognition in this way, the system must fall back to metadata. It is, however, the best way for a program to check if a file it has been told to process is of the correct format: while the file's name or metadata may be altered independently of its content, failing a well-designed magic number test is a pretty sure sign that the file is either corrupt or of the wrong type.

So-called shebang
Shebang (Unix)

In computing, a shebang refers to the characters "#!" when they are the first two characters in a script file. Unix-like operating systems take the presence of these two characters as an indication that the file is a script, and try to execute that script using the interpreter specified by the rest of the first line in the file....
 lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.

Another operating system using magic numbers is AmigaOS
AmigaOS

AmigaOS is the default native operating system of the Amiga personal computer. It was developed first by Commodore International, and initially introduced in 1985 with the Amiga 1000....
 where magic numbers were called "Magic Cookies" and were adopted as standard system to recognize executables in Hunk
Amiga Hunk

Hunk is the executable file format of tools and programs of the AmigaOS based on Motorola 68000 CPU and other processors of the same family.This kind of executable got its name from the fact that the software programmed on Amiga is divided in its internal structure into many pieces called hunks, in which every portion could contain eith...
 executable file format and also to let single programs, tools and utilities to deal automatically with their saved data files, or any other kind of file types when saving and loading data. This system was then enhanced with Amiga standard Datatype
Amiga support and maintenance software

This article is a split of main article Amiga software and refers to any support and maintenance software that run on Amiga line of computers.See also related articles Amiga productivity software and Amiga Internet and communications software for other information regarding software that run on Amiga....
 recognition system.

External metadata

A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself.

This approach keeps the metadata separate from both the main data and the name, but is also less portable
Porting

In computer science, porting is the process of adapting software so that an executable Computer program can be created for a computing environment that is different from the one for which it was originally designed ....
 than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's
MS-DOS

MS-DOS is an operating system commercialized by Microsoft. It was the most commonly used member of the DOS family of operating systems and was the main operating system for personal computers during the 1980s....
 three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.

Note that zip files or archive files
File archiver

A file archiver is a computer program that combines a number of computer file together into one archive file, or a series of archive files, for easier transportation or storage....
 solve the problem of handling metadata. A utility program collects multiple files together along with metadata about each file and the folders/directories they came from all within one new file (e.g. a zip file with extension .zip). The new file is also compressed and possibly encrypted, but now is transmissible as a single file across operating systems by FTP systems or attached to email. At the destination, it must be unzipped by a compatible utility to be useful, but the problems of transmission are solved this way.

Mac OS type-codes
The Mac OS
Mac OS

Mac OS is the trademarked name for a series of graphical user interface-based operating systems developed by Apple Inc. for their Macintosh line of computer systems....
' Hierarchical File System
Hierarchical File System

Hierarchical File System , is a file system developed by Apple Inc. for use in computer systems running Mac OS. Originally designed for use on floppy disk and hard disks, it can also be found on read-only media such as CD-ROMs....
 stores codes for creator
Creator code

A creator code is a mechanism introduced in pre-Mac OS X versions of the Apple Macintosh operating system to link a data file to the Computer software which created it, in a manner similar to file extensions in other operating systems....
 and type
Type code

A type code is the only mechanism used in pre-Mac OS X versions of the Apple Macintosh operating system to denote a file's file format, in a manner similar to file extensions in other operating systems....
 as part of the directory entry for each file. These codes are referred to as OSType
OSType

OSType is the name of a four-byte sequence commonly used as an identifier in Mac OS. While the bytes can have any value, they are usually characters from the ASCII or Mac OS Roman character sets....
s, and for instance a HyperCard
HyperCard

HyperCard was an application program created by Bill Atkinson for Apple Inc. that was among the first successful hypermedia systems before the World Wide Web....
 "stack" file has a creator of WILD (from Hypercard's previous name, "WildCard") and a type of STAK. RISC OS
RISC OS

RISC OS is a computer operating system which was originally developed by Acorn Computers Ltd in Cambridge, England for their ARM architecture based computers....
 uses a similar system, consisting of a 12-bit
Bit

A bit is a binary numeral system numerical digit, taking a value of either 0 or 1. Binary digits are a basic unit of information Computer data storage and transmission in digital computing and digital information theory....
 number which can be looked up in a table of descriptions — e.g. the hexadecimal number FF5 is "aliased" to PoScript, representing a PostScript
PostScript

PostScript is a dynamically typed concatenative programming language programming language created by John Warnock and Charles Geschke in 1982. PostScript is best known for its use as a page description language in the electronic and desktop publishing areas....
 file.

Mac OS X Uniform Type Identifiers (UTIs)
A Uniform Type Identifier (UTI) is a method used in Mac OS X
Mac OS X

Mac OS X is a line of computer operating systems developed, marketed, and sold by Apple Inc., and since 2002 has been included with all new Macintosh computer systems....
 for uniquely identifying "typed" classes of entity, such as file formats. It was developed by Apple
Apple Computer

Apple Inc., formerly Apple Computer Inc., is an United States multinational corporation which designs and manufactures consumer electronics and software products....
 as a replacement for OSType
OSType

OSType is the name of a four-byte sequence commonly used as an identifier in Mac OS. While the bytes can have any value, they are usually characters from the ASCII or Mac OS Roman character sets....
 (type
Type code

A type code is the only mechanism used in pre-Mac OS X versions of the Apple Macintosh operating system to denote a file's file format, in a manner similar to file extensions in other operating systems....
 & creator code
Creator code

A creator code is a mechanism introduced in pre-Mac OS X versions of the Apple Macintosh operating system to link a data file to the Computer software which created it, in a manner similar to file extensions in other operating systems....
s).

The UTI is a Core Foundation
Core Foundation

Core Foundation is a C application programming interface in Mac OS X, and is a mix of low-level routines and wrapper functions. Most of it is available in an open source project called CF-Lite that can be used to write cross-platform applications for Mac OS X, Linux, and Microsoft Windows ....
 string
String (computer science)

In computer programming and some branches of mathematics, a string is an ordered sequence of symbols. These symbols are chosen from a predetermined set or alphabet....
, which uses a reverse-DNS
Reverse-DNS

The reverse domain name system, or Reverse-DNS, is a system for naming components, packages, or types in computer systems. A characteristic of reverse-DNS strings is that they are based on registered domain names, only are reversed for sorting purposes....
 string. Common or standard types use the public domain (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format
Portable Document Format

Portable Document Format is a file format created by Adobe Systems in 1993 for document exchange. PDF is used for representing two-dimensional documents in a manner independent of the application software, hardware, and operating system....
). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.

In addition to file formats, UTIs can also be used for other entities which can exist in the OS X file system
File system

In computing, a file system is a method for store and organize computer files and the data they contain to make it easy to find and access them....
, including:

  • Pasteboard data
  • Folders
    Directory (file systems)

    In computing, a directory, folder, catalog, or drawer is a virtual container within a digital file system, in which groups of files and other directories can be kept and organized....
     (directories)
  • Translatable types (as handled by the Translation Manager)
  • Bundles
  • Frameworks
  • Streaming data
  • Aliases and symlinks


OS/2 Extended Attributes
The HPFS, FAT12 and FAT16
File Allocation Table

File Allocation Table or FAT is a computer file system architecture now widely used on most computer systems and most memory cards, such as those used with digital cameras....
 (but not FAT32) filesystems allow the storage of "extended attributes" with files. These comprise an arbitrary set of triplets with a name, a coded type for the value and a value, where the names are unique and values can be up to 64 KB long. There are standardized meanings for certain types and names (under OS/2). One such is that the ".TYPE" extended attribute is used to determine the file type. Its value comprises a list of one or more file types associated with the file, each of which is a string, such as "Plain Text" or "HTML document". Thus a file may have several types.

The NTFS
NTFS

NTFS is the standard file system of Windows NT, including its later versions Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, and Windows 7....
 filesystem also allows to store OS/2 extended attributes, as one of file forks, but this feature is merely present to support the OS/2 subsystem (not present in XP), so the Win32 subsystem treats this information as an opaque block of data and does not use it. Instead, it relies on other file forks to store meta-information in Win32-specific formats. OS/2 extended attributes can still be read and written by Win32 programs, but the data must be entirely parsed by applications.

POSIX extended attributes
On Unix
Unix

Unix is a computer operating system originally developed in 1969 by a group of American Telephone & Telegraph employees at Bell Labs, including Ken Thompson , Dennis Ritchie, Douglas McIlroy, and Joe Ossanna....
 and Unix-like
Unix-like

A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
 systems, the ext2
Ext2

The ext2 or second extended filesystem is a file system for the Linux kernel . It was initially designed by R?my Card as a replacement for the extended file system ....
, ext3
Ext3

The ext3 or third extended filesystem is a journaling file system that is commonly used by the Linux operating system. It is the default file system for many popular Linux distributions....
, ReiserFS
ReiserFS

ReiserFS is a general-purpose, journaling file system designed and implemented by a team at Namesys led by Hans Reiser. ReiserFS is currently supported on Linux....
 version 3, XFS
XFS

XFS is a high-performance journaling file system created by Silicon Graphics, originally for their IRIX operating system and later ported to Linux kernel....
, JFS, FFS, and HFS+
HFS Plus

HFS Plus or HFS+ is a file system developed by Apple Inc. to replace their Hierarchical File System as the primary file system used in Apple Macintosh computers ....
 filesystems allow the storage of extended attributes with files. These include an arbitrary list of "name=value" strings, where the names are unique, which can be accessed by their "name" parts.

PRONOM Unique Identifiers (PUIDs)
The PRONOM Persistent Unique Identifier (PUID)
PRONOM technical registry

PRONOM is a world wide web-based technical registry to support digital preservation services, developed by The National Archives . PRONOM was the first and remains, to date, the only operational public file format registry in the world....
 is an extensible scheme of persistent, unique and unambiguous identifiers for file formats, which has been developed by The National Archives of the UK
The National Archives (UK)

The National Archives is a UK government department and an executive agency of the Secretary of State for Justice. It was created in April 2003 to maintain a national archive for "England, Wales and the central UK government"....
 as part of its PRONOM technical registry
PRONOM technical registry

PRONOM is a world wide web-based technical registry to support digital preservation services, developed by The National Archives . PRONOM was the first and remains, to date, the only operational public file format registry in the world....
 service. PUIDs can be expressed as Uniform Resource Identifier
Uniform Resource Identifier

In Information technology, a Uniform Resource Identifier is a Character string of Character s used to Identifier or name a Resource on the Internet....
s using the info:pronom/ namespace. Although not yet widely used outside of UK government and some digital preservation
Digital preservation

Digital preservation is the management of digital information over time. Preservation of digital information is widely considered to require more constant and ongoing attention than preservation of other media....
 programmes, the PUID scheme does provide greater granularity than most alternative schemes.

MIME types
MIME
MIME

Multipurpose Internet Mail Extensions is an Internet standard that extends the format of electronic mail to support:* Text in character sets other than ASCII...
 types are widely used in many Internet
Internet

The Internet is a global network of interconnected computers, enabling users to share information along multiple channels. Typically, a computer that connects to the Internet can access information from a vast array of available server and other computers by moving information from them to the computer's local memory....
-related applications, and increasingly elsewhere, although their usage for on-disc type information is rare. These consist of a standardised system of identifiers (managed by IANA
Internet Assigned Numbers Authority

The Internet Assigned Numbers Authority is the entity that oversees global IP address, root nameserver for the Domain Name System , Internet media type, and other Internet protocol assignments....
) consisting of a type and a sub-type, separated by a slash
Slash (punctuation)

The slash is a punctuation mark. It is also called a virgule, diagonal, stroke, forward slash, oblique dash, slant, separatrix, scratch comma, over, slak, whack....
 — for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail
E-mail

Electronic mail, often abbreviated as e-mail, email, E-Mail, or eMail, is any method of creating, transmitting, or storing primarily text-based human communications with digital communications systems....
, independent of the source and target operating systems. MIME types identify files on BeOS
BeOS

BeOS was an operating system for personal computers which began development by Be Inc. in 1991. It was first written to run on BeBox hardware. BeOS was optimized for digital media work and was written to take advantage of modern hardware facilities such as symmetric multiprocessing by utilizing modular I/O bandwidth, pervasive multithreading,...
, AmigaOS 4.0 and MorphOS
MorphOS

MorphOS is a computer operating system . It is a mixed proprietary software and open source OS produced for the Pegasos PowerPC -processor-based computer, most models of PPC-accelerator-equipped Amiga computers, and a series of Freescale development boards that use the Genesi Firmware, including the EFIKA and mobileGT....
, as well as store unique application signatures for application launching. In AmigaOS and MorphOS the Mime type system works in parallel with Amiga specific Datatype
Amiga support and maintenance software

This article is a split of main article Amiga software and refers to any support and maintenance software that run on Amiga line of computers.See also related articles Amiga productivity software and Amiga Internet and communications software for other information regarding software that run on Amiga....
 system.

There are problems with the MIME types though; several organisations and people have created their own MIME types without registering them properly with IANA, which makes the use of this standard awkward in some cases.

File format identifiers (FFIDs)
File format identifiers is another, not widely used way to identify file formats according to their origin and their file category. It was created for the Description Explorer suite of software. It is composed of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part indicates the organisation origin/maintainer (this number represents a value in a company/standards organisation database), the 2 following digits categorize the type of file in hexadecimal. The final part is composed of the usual file extension of the file or the international standard number of the file, padded left with zeros. For example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the ISO Organisation.

File structure

There are several types of ways to structure data in a file. The most usual ones are described below.

Raw memory dumps/unstructured formats

Earlier file formats used raw data formats that consisted of directly dumping the memory images of one or more structures into the file.

This has several drawbacks. Unless the memory images also have reserved spaces for future extensions, extending and improving this type of structured file is very difficult. It also creates files that might be specific to one platform or programming language (for example a structure containing a Pascal
Pascal (programming language)

Pascal is an influential imperative programming and Procedural programming programming language, designed in 1968/9 and published in 1970 by Niklaus Wirth as a small and efficient language intended to encourage good programming practices using structured programming and data structure....
 string is not recognized as such in C
C (programming language)

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
). On the other hand, developing tools for reading and writing these types of files is very simple.

The limitations of the unstructured formats led to the development of other types of file formats that could be easily extended and be backward compatible at the same time.

Chunk based formats

Electronic Arts
Electronic Arts

Electronic Arts is an international video game developer, marketer, video game publisher and distributor of video games. Established in 1982 by Trip Hawkins, the company was a pioneer of the early home computer games industry and was notable for promoting the designers and programmers responsible for its games....
 and Commodore
Commodore International

Commodore, the commonly used name for Commodore International, was a United States electronics company based in West Chester, Pennsylvania which was a vital player in the home computer/personal computer field in the 1980s....
-Amiga
Amiga

The Amiga is a family of personal computers originally developed by Amiga Corporation. Development on the Amiga began in 1982 with Jay Miner as the principal hardware designer....
 pioneered this file format in 1985, with their IFF (Interchange File Format
Interchange File Format

Interchange File Format , is a generic file format originally introduced by the Electronic Arts company in 1985 in order to ease transfer of data between software produced by different companies....
) file format. In this kind of file structure, each piece of data is embedded in a container that contains a signature identifying the data, as well the length of the data (for binary encoded files). This type of container is called a "chunk". The signature is usually called a chunk id, chunk identifier, or tag identifier.

With this type of file structure, tools that do not know certain chunk identifiers simply skip those that they do not understand.

This concept has been taken again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG
JPEG

In computing, JPEG is a commonly used method of for photographic images. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality....
 storage, DER (Distinguished Encoding Rules
Distinguished Encoding Rules

Distinguished Encoding Rules , is a message transfer syntax specified by the ITU in X.690. It is a method for encoding a data object such as an X.509 certificate, to be digitally signed or to have its signature verified....
) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and Structured Data Exchange Format (SDXF)
SDXF

SDXF stands for "Structured Data eXchange Format", and was published as Internet RFC 3072.It allows arbitrary structured data of different types to be assembled together for exchanging between computers of different architectures....
. Even XML can be considered a kind of chunk based format, since each data element is surrounded by tags which are akin to chunk identifiers.

Directory based formats

This is another extensible format, that closely resembles a file system (OLE
OLE

OLE, Ole or Ol? may refer to:* Object Linking and Embedding, a distributed object system and protocol developed by Microsoft * Object Locative Environment Coordinate System...
 Documents are actual filesystems), where the file is composed of 'directory entries' that contain the location of the data within the file itself as well as its signatures (and in certain cases its type). Good examples of these types of file structures are disk image
Disk image

A disk image is a single file containing the complete contents and structure representing a data storage medium or device, such as a hard drive, CD, or DVD....
s, OLE
OLE

OLE, Ole or Ol? may refer to:* Object Linking and Embedding, a distributed object system and protocol developed by Microsoft * Object Locative Environment Coordinate System...
 documents and TIFF images.

See also

  • Audio file format
    Audio file format

    An audio file format is a container format for storing Sound data on a computer system.The general approach towards storing digital audio is to sample the audio voltage which, on playback, would correspond to a certain position of the membrane in a speaker of the individual channels with a certain Audio bit depth ? the number of bits p...
  • Chemical file format
    Chemical file format

    This article discusses some common molecular file formats, including usage and converting between them. It also lists a few sources for freely obtaining chemical data on the Internet....
  • Container format (digital)
  • Document file format
    Document file format

    A document file format is a Text file or Binary file computer file format for storing documents on a computer storage, especially for use by computers....
  • DROID
    PRONOM technical registry

    PRONOM is a world wide web-based technical registry to support digital preservation services, developed by The National Archives . PRONOM was the first and remains, to date, the only operational public file format registry in the world....
     file format identification utility
  • File (Unix)
    File (Unix)

    file is a standard Unix computer program for determining the type of data contained in a computer file....
    , a file type identification utility
  • Filename extension
    Filename extension

    A filename extension is a substring to the filename of a computer file applied to indicate the encoding convention of its contents.In some operating systems it is optional, while in some others it is a requirement....
  • Free file format
    Free file format

    A free file format is a file format whose full specification is freely available and for which there are no restrictions on its use. Users may design and use variations that suit their needs, and contribute enhancements for potential incorporation into the next official version of the format....
  • Future proofing
  • Graphics file format summary
  • List of archive formats
    List of archive formats

    This is a list of file formats used by file archivers and data compressions used to create archive files....
  • Image file formats
    Image file formats

    Image file formats are standardized means of organizing and storing images. This entry is about digital image formats used to store photographic and other images; ....
  • List of file formats
    List of file formats

    This is a list of file formats organized by type, as can be found on computers. Filename extensions are usually noted in parentheses if they differ from the format name or abbreviation....
  • List of motion and gesture file formats
    List of motion and gesture file formats

    With the development of gesture controllers, haptic systems, motion capture systems, etc, on the one hand, and with the need of allowing virtual reality systems to inter-communicate through control data, the question of gesture and motion takes more and more importance....
  • Magic number (programming)
    Magic number (programming)

    In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:* a constant used to identify a file format or protocol;...
  • Object file format
  • Object file
    Object file

    In computer science, object code, or an object file, is the representation of code that a compiler or assembler generates by processing a source code file....
  • Open format
    Open format

    An open format is a published specification for storing digital data, usually maintained by a standards organization, which basically can be used and implemented by anyone....
  • TrID
    TrID

    TrID is an Software utility designed to identify File format from their binary signatures. It is Extensibility and can be trained to recognize new formats in a fast and automatic way....
    , a freeware
    Freeware

    Freeware is computer software that is available for use at no cost or for an optional fee. Freeware is different from shareware; the latter obliges the user to pay ....
     file type identification utility
  • Windows file types
    Windows file types

    Computer files are encoded in many different file types. Here is a list of file extensions that are associated with various commonly used file types....


External links

|Computers/Data_Formats/ Data Formats}}