Sparse file - AbsoluteAstronomy.com

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, a sparse file is a type of computer file

Computer file

A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...

that attempts to use file system

File system

A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

space more efficiently when blocks allocated to the file are mostly empty. This is achieved by writing brief information (metadata

Metadata

The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

) representing the empty blocks to disk instead of the actual "empty" space which makes up the block, using less disk space. The full block size is written to disk as the actual size only when the block contains "real" (non-empty) data.

When reading sparse files, the file system transparently converts metadata representing empty blocks into "real" blocks filled with zero bytes at runtime. The application is unaware of this conversion.

Most modern file systems support sparse files, including most Unix

Unix

Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

variants and NTFS

NTFS

NTFS is the standard file system of Windows NT, including its later versions Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, and Windows 7....

, but notably not Apple's HFS+. Sparse files are commonly used for disk image

Disk image

A disk image is a single file or storage device containing the complete contents and structure representing a data storage medium or device, such as a hard drive, tape drive, floppy disk, CD/DVD/BD, or USB flash drive, although an image of an optical disc may be referred to as an optical disc image...

s, database

Database

A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

snapshots, log file

Log file

The term log file can refer to:*Text saved by a computer operating system to recored its activities, such as by the Unix syslog facility*Output produced by a data loggerAlso see Wikibooks chapter...

s and in scientific applications.

Advantages

The advantage of sparse files is that storage is only allocated when actually needed: disk space is saved, and large files can be created even if there is insufficient free space on the file system.

Disadvantages

Disadvantages are that sparse files may become fragmented

Fragmentation (computer)

In computer storage, fragmentation is a phenomenon in which storage space is used inefficiently, reducing storage capacity and in most cases reducing the performance. The term is also used to denote the wasted space itself....

; file system free space reports may be misleading; filling up file systems containing sparse files can have unexpected effects (such as disk-full or quota-exceeded errors when merely overwriting an existing portion of a file that happened to have been sparse); and copying a sparse file with a program

Computer program

A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...

that does not explicitly support them may copy the entire, uncompressed size of the file, including the sparse, mostly zero sections which are not on disk—losing the benefits of the sparse property in the file. Sparse files are also not fully supported by all backup software or applications.

Sparse Files in Unix

Sparse files are typically handled transparently to the user. But the differences between a normal file and sparse file become apparent in some situations.

Creation

The Unix

Unix

command:
dd if=/dev/null of=sparse-file bs=1k seek=5120
will create a file of five mebibyte

Mebibyte

The mebibyte is a multiple of the unit byte for digital information. The binary prefix mebi means 220, therefore 1 mebibyte is . The unit symbol for the mebibyte is MiB. The unit was established by the International Electrotechnical Commission in 2000 and has been accepted for use by all major...

s in size, but with no data stored on disk (only metadata

Metadata

). (GNU

GNU

GNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...

ddDd (Unix) In computing, dd is a common Unix program whose primary purpose is the low-level copying and conversion of raw data. According to the manual page for Version 7 Unix, it will "convert and copy a file". It is used to copy a specified number of bytes or blocks, performing on-the-fly byte order... has this behavior because it calls ftruncate to set the file size; other implementations may merely create an empty file.)

Similarly the truncate command may be used, if available:
truncate -s 5M

Detection

The -s option of the ls command shows the occupied space in blocks,
and -k the apparent size in blocks too:
ls -lks sparse-file
Or use -h to print both in human readable format.

Alternatively, try the du command which prints the occupied space, while ls print the apparent size.
The option --block-size=1 prints the occupied space in bytes instead of blocks,
so that it can be compared to the ls output:
du --block-size=1 sparse-file
ls -l sparse-file

Copying

Normally, the GNU version of cp

Cp (Unix)

cp is a UNIX command used to copy a file. Files can be copied either to the same directory or to a completely different directory, possibly on a different file system or hard disk drive. If the file is copied to the same directory, the new file must have a different name to the original; in all...

is good at detecting whether a file is sparse, so it suffices to run:
cp sparse-file new-file
and new-file will be sparse. However, GNU cp does have a --sparse=WHEN option. This is especially useful if a sparse-file has somehow become non-sparse (i.e. the empty blocks have been written out to disk in full). Disk space can be recovered by doing:
cp --sparse=always formerly-sparse-file recovered-sparse-file
Most cp implementations do not support the --sparse option and will always expand sparse files, like FreeBSD

FreeBSD

FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...

's cp. A partially-viable alternative on those systems is to use rsync

Rsync

rsync is a software application and network protocol for Unix-like and Windows systems which synchronizes files and directories from one location to another while minimizing data transfer using delta encoding when appropriate. An important feature of rsync not found in most similar...

with its own --sparse option instead of cp. Unfortunately you cannot combine --sparse with --inplace, so rsync'ing huge files across the network will always be wasteful of either network bandwidth or disk bandwidth.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.