ZFS - AbsoluteAstronomy.com

Computing

Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

, ZFS is a combined file system

File system

A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

and logical volume manager

Logical Volume Manager

Logical Volume Manager may refer to:*Logical Volume Manager *Logical Volume Manager...

Sun Microsystems

Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...

. The features of ZFS include data integrity verification against data corruption

Data corruption

Data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data...

modes (like bit rot

Bit rot

Bit rot, also known as bit decay, data rot, or data decay, is a colloquial computing term used to describe either a gradual decay of storage media or the degradation of a software program over time. The latter use of the term implies that software can wear out or rust like a physical tool...

), support for high storage capacities, integration of the concepts of filesystem and volume management

Volume (computing)

In the context of computer operating systems, volume is the term used to describe a single accessible storage area with a single file system, typically resident on a single partition of a hard disk. Similarly, it refers to the logical interface used by an operating system to access data stored on...

, snapshots

Snapshot (computer storage)

In computer systems, a snapshot is the state of a system at a particular point in time. The term was coined as an analogy to that in photography. It can refer to an actual copy of the state of a system or to a capability provided by certain systems....

and copy-on-write

Copy-on-write

Copy-on-write is an optimization strategy used in computer programming. The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, they can all be given pointers to the same resource...

clones, continuous integrity checking and automatic repair, RAID-Z and native NFSv4 ACLs. ZFS is implemented as open-source software

Open-source software

Open-source software is computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under a software license that permits users to study, change, improve and at times also to distribute the software.Open...

, licensed under the Common Development and Distribution License

Common Development and Distribution License

Common Development and Distribution License is a free software license, produced by Sun Microsystems, based on the Mozilla Public License , version 1.1....

(CDDL). The ZFS name is a trademark

Trademark

A trademark, trade mark, or trade-mark is a distinctive sign or indicator used by an individual, business organization, or other legal entity to identify that the products or services to consumers with which the trademark appears originate from a unique source, and to distinguish its products or...

of Oracle

Oracle Corporation

Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...

History

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick

Jeff Bonwick

Jeff Bonwick was a Sun Fellow at Sun Microsystems, later a Vice President at Sun and then a Senior Software Architect at Oracle until his departure from the company on 30 September 2010.He led the team which developed ZFS for Solaris....

. It was announced on September 14, 2004. Source code for ZFS was integrated into the main trunk of Solaris development on October 31, 2005 and released as part of build 27 of OpenSolaris

OpenSolaris

OpenSolaris was an open source computer operating system based on Solaris created by Sun Microsystems. It was also the name of the project initiated by Sun to build a developer and user community around the software...

on November 16, 2005. Sun announced that ZFS was included in the 6/06 update to Solaris 10 in June 2006, one year after the opening of the OpenSolaris community.

The name originally stood for "Zettabyte File System". A ZFS file system can store up to 256 quadrillion

Quadrillion

Quadrillion may mean either of the two numbers :* 1,000,000,000,000,000 – for all short scale countries; increasingly common meaning in English language usage* 1,000,000,000,000,000,000,000,000 – for all...

zettabytes

Zettabyte

A zettabyte is a unit of information or computer storage equal to one sextillion bytes....

(ZB), where a zettabyte is 2⁷⁰ bytes.

Version numbers

As new features are introduced the version number of the ZPool and Z file system are incremented to designate the format and features available.
Notable ZFS storage pool versions include:

10 - Supported by Solaris 10 U7
14 - Supported by OpenSolaris 2009.06, FreeBSD 8.1
15 - Supported by Solaris 10 10/09 (U8), FreeBSD 8.2
17 - Triple Parity RAID-Z
19 - Supported by Solaris 10 09/10
21 - Deduplication
22 - Solaris 10 9/10 (U9)
28 - FreeBSD 9.0, OpenIndiana/Illumos, ZFSOnLinux, ZFS-FUSE
29 - Solaris 10 8/11 (U10)
30 - Encryption support, - not compatible with open implementations, only in closed source for-license Solaris 11 Express release.

Data Integrity

One major feature that distinguishes ZFS from other file systems is that ZFS is designed from the ground up with a focus on data integrity. That is, protect the user's data on disk, against silent corruption caused by e.g., bit rot

Bit rot

, cosmic radiation, current spikes, bugs in disk firmware, ghost writes, etc.

Data Integrity is a high priority in ZFS because recent research shows that none of the currently widespread file systems — such as Ext, XFS

XFS

XFS is a high-performance journaling file system created by Silicon Graphics, Inc. It is the default file system in IRIX releases 5.3 and onwards and later ported to the Linux kernel. XFS is particularly proficient at parallel IO due to its allocation group based design...

, JFS, ReiserFS

ReiserFS

ReiserFS is a general-purpose, journaled computer file system designed and implemented by a team at Namesys led by Hans Reiser. ReiserFS is currently supported on Linux . Introduced in version 2.4.1 of the Linux kernel, it was the first journaling file system to be included in the standard kernel...

, or NTFS

NTFS

NTFS is the standard file system of Windows NT, including its later versions Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, and Windows 7....

— nor Hardware RAID provide sufficient protection against such problems. It is well known that Hardware RAID has some issues with data integrity. Initial research indicates that ZFS clearly protects data better than earlier solutions.

For ZFS, data integrity is achieved by using a (Fletcher-based

Fletcher's checksum

The Fletcher checksum is an algorithm for computing a position-dependent checksum devised by John G. Fletcher at Lawrence Livermore Labs in the late 1970s. A description of the algorithm and an analysis of the performance characteristics of a particular implementation were published in the IEEE...

) checksum or a (SHA-2

SHA-2

In cryptography, SHA-2 is a set of cryptographic hash functions designed by the National Security Agency and published in 2001 by the NIST as a U.S. Federal Information Processing Standard. SHA stands for Secure Hash Algorithm. SHA-2 includes a significant number of changes from its predecessor,...

) hash throughout the file system tree. Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at its pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a Merkle tree

Hash tree

In cryptography and computer science Hash trees or Merkle trees are a type of data structure which contains a tree of summary information about a larger piece of data – for instance a file – used to verify its contents. Hash trees are a combination of hash lists and hash chaining, which in turn are...

. When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data is passed up the programming stack to the process that asked for it. If the values do not match, then ZFS can heal the data if the storage pool has redundancy via ZFS type of mirror

Disk mirroring

In data storage, disk mirroring or RAID1 is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability...

ing or RAID

RAID

RAID is a storage technology that combines multiple disk drive components into a logical unit...

. If the storage pool consists of a single disk it is possible to provide such redundancy by specifying "copies=2" (or "copies=3") which means that data will be stored twice (thrice) on the disk, effectively halving (1/3) the storage capacity of the disk. If redundancy exists, then ZFS fetches the second copy of the data (or recreates it via a RAID recovery mechanism), and recalculates the checksum—hopefully reproducing the original value this time. If the data passes the integrity check, the system can then update the first copy with known-good data so that redundancy can be restored.

ZFS cannot fully protect the user's data when using a hardware RAID controller, as it is not able to perform the automatic self-healing unless it controls the redundancy of the disks and data. ZFS prefers direct, exclusive access to the disks, with nothing in between that interferes. If the user insists on using hardware-level RAID, the controller should be configured as JBOD mode (i.e. turn off RAID-functionality) for ZFS to be able to guarantee data integrity. Note that hardware RAID configured as JBOD may still detach disks that do not respond in time; and as such may require TLER

Time-Limited Error Recovery

Time-Limited Error Recovery is a name used by Western Digital for a hard disk drive firmware bugfix that allows improved error handling in a RAID environment...

/CCTL/ERC-enabled disks to prevent drive dropouts: http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives

These limitations do not apply when using a non-RAID controller, which is the preferred method of supplying disks to ZFS. A non-RAID controller is generally called a Host Bus Adapter

Host adapter

In computer hardware, a host controller, host adapter, or host bus adapter connects a host system to other network and storage devices...

(HBA) and allows the operating system

Operating system

An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

to control timeout and error control, rather than the RAID controller which generally has very strict timeout control.

A modern hard disk devotes a large portion of its capacity to error detection data. Many errors occur during normal usage, but are corrected by the disk's internal software, and thus are not visible to the host software. A tiny fraction of errors are not corrected. For example, a modern Enterprise SAS disk specification estimates this fraction to be one uncorrected error in every 10¹⁶ bits, or approximately one in every 1.2 PB

Petabyte

A petabyte is a unit of information equal to one quadrillion bytes, or 1000 terabytes. The unit symbol for the petabyte is PB...

. A smaller fraction of errors are not even detected by the disk firmware or the host operating system. This is known as "silent corruption". In a recent study, CERN

CERN

The European Organization for Nuclear Research , known as CERN , is an international organization whose purpose is to operate the world's largest particle physics laboratory, which is situated in the northwest suburbs of Geneva on the Franco–Swiss border...

found this issue to be problematic.

These problems have not been a serious concern while storage devices remained relatively small and slow. Hence, a user very rarely faced silent corruption, so it was not deemed to be a problem that required a solution. With the advent of larger drives and very fast RAID setups, a user is capable of transferring 10¹⁶ bits in a sufficiently short time. In particular, ZFS creator Jeff Bonwick stated that the fast database at Greenplum

Greenplum

Greenplum is a database software company in San Mateo, California, specializing in enterprise data cloud solutions for large-scale data warehousing and analytics...

— a database software company located in San Mateo, California specializing in enterprise data cloud solutions for large-scale data warehousing and analytics — faces silent corruption every 15 minutes, which is one of the reasons that Greenplum now base their fast database solution on ZFS. These large and fast raid setups require new file systems that focus on data integrity. This is one of the design goals of ZFS, as explained by Jeff Bonwick.

ZFS has no "fsck" repair tool, common on Unix/Linux filesystem, which examines and repairs data. Instead, ZFS has a repair tool called "scrub" which examines and repairs Silent Corruption and other problems. Some differences are:

fsck must be run on an offline filesystem, which means the filesystem must be unmounted and not useable while being repaired.
fsck usually only checks metadata (such as the journal log) but never checks the data itself. This means, after an fsck, the data might still be corrupt.
scrub does not need the ZFS filesystem to be taken offline. scrub is designed to be used on a working, mounted alive filesystem.
scrub checks everything, including metadata and the data.

The official recommendation from Sun/Oracle is to scrub once every month with Enterprise disks, because they have much higher reliability than cheap commodity disks. If using cheap commodity disks, scrub every week.

However, no system is immune to bugs or hardware not following standards.

"...For example: FLUSH CACHE should only return, when the cache is flushed. But there are dirt cheap converter chips that sends the FLUSH CACHE to disk, but returns a successful FLUSH CACHE in the same moment back to the OS (of course without having NVRAM on disk or in a controller as this would allow to ignore CACHE FLUSH). Or interface converters reordering commands in really funny ways. By such reordering it may happen, that the uberblock is written to disk, before the rest of the structure has been written to disk..."
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html

Thus, there are known cases where ZFS has had problems. Therefore, as an extra safety measure, it is possible to go back in time by using the "-F" flag with the "zpool" command. ZFS use Copy-On-Write, which means old data is not altered. Whenever data is edited and updated, the old data is always left intact, and only the edits are stored, on a new place on the disk. This means every change can be traced back in time. This allows the user to discard the latest change which caused the problem, and instead go back to an earlier functioning state. This is also how ZFS Snapshots works.

Storage pools

Unlike traditional file systems, which reside on single devices and thus require a volume manager to use more than one device, ZFS filesystems are built on top of virtual storage pools called zpools. A zpool is constructed of virtual device

Virtual device

A virtual device in Unix is a file such as :/dev/null or :/dev/urandom, that is treated as a device, as far as user level software is concerned, but is generated by the kernel without reference to hardware....

s (vdevs), which are themselves constructed of block devices: files, hard drive partitions

Disk partitioning

Disk partitioning is the act of dividing a hard disk drive into multiple logical storage units referred to as partitions, to treat one physical disk drive as if it were multiple disks. Partitions are also termed "slices" for operating systems based on BSD, Solaris or GNU Hurd...

, or entire drives, with the last being the recommended usage. Block devices within a vdev may be configured in different ways, depending on needs and space available: non-redundantly (similar to RAID 0), as a mirror (RAID 1) of two or more devices, as a RAID-Z (similar to RAID-5) group of three or more devices, or as a RAID-Z2 (similar to RAID-6) group of four or more devices. In July 2009, triple-parity RAID-Z3 was added to OpenSolaris

OpenSolaris

.

Thus, a zpool (ZFS storage pool) is vaguely similar to a computer's RAM. The total RAM pool capacity depends on the number of RAM memory sticks and the size of each stick. Likewise, a zpool consists of one or more vdevs. Each vdev can be viewed as a group of hard disks (or partitions, or files, etc.). Each vdev should have redundancy because if a vdev is lost, then the whole zpool is lost. Thus, each vdev should be configured as RAID-Z1, RAID-Z2, mirror, etc. It is not possible to change the number of drives in an existing vdev (Block Pointer Rewrite will allow this, and also allow defragmentation), but it is always possible to increase storage capacity by adding a new vdev to a zpool. It is possible to swap a drive to a larger drive and resilver (repair) the zpool. If this procedure is repeated for every disk in a vdev, then the zpool will grow in capacity when the last drive is resilvered. A vdev will have the same capacity as the smallest drive in the group. For instance, a vdev consisting of three 500 GB and one 700 GB drive, will have a capacity of 4 x 500 GB.

In addition, pools can have hot spare

Hot spare

A hot spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation...

s to compensate for failing disks. ZFS also supports both read and write caching, for which special devices can be used. Solid State Devices can be used for the L2ARC, or Level 2 adaptive replacement cache

Adaptive Replacement Cache

Adaptive Replacement Cache is a page replacement algorithm withbetter performance than LRU developed at the IBM Almaden Research Center. This is accomplished by keeping track of both Frequently Used and Recently Used pages plus a recent eviction history for both...

, speeding up read operations, while NVRAM buffered SLC memory can be boosted with supercapacitors to implement a fast, non-volatile write cache, improving synchronous writes.
Finally, when mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis.

Storage pool composition is not limited to similar devices but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to diverse filesystems as needed. Arbitrary storage device types can be added to existing pools to expand their size at any time.
The storage capacity of all vdevs is available to all of the file system instances in the zpool. A quota

Disk quota

A disk quota is a limit set by a system administrator that restricts certain aspects of file system usage on modern operating systems. The function of using disk quotas is to allocate limited disk space in a reasonable way.-Types of quotas:...

can be set to limit the amount of space a file system instance can occupy, and a reservation can be set to guarantee that space will be available to a file system instance.

Capacity

ZFS is a 128-bit

128-bit

There are currently no mainstream general-purpose processors built to operate on 128-bit integers or addresses, though a number of processors do operate on 128-bit data. The IBM System/370 could be considered the first rudimentary 128-bit computer as it used 128-bit floating point registers...

file system, so it can address 1.84 × 10¹⁹ times more data than 64-bit systems such as NTFS. The limitations of ZFS are designed to be so large that they would never be encountered. This was assured by surpassing physical rather than theoretical limitations—there simply is not enough usable matter on the planet Earth to support a maximized ZFS filesystem. Some theoretical limits in ZFS are:

2⁴⁸ — Number of entries in any individual directory
16 exabyte
Exabyte
The exabyte is a unit of information or computer storage equal to one quintillion bytes . The unit symbol for the exabyte is EB...

s — Maximum size of a single file
16 exabytes — Maximum size of any attribute
256 zettabyte
Zettabyte
A zettabyte is a unit of information or computer storage equal to one sextillion bytes....

s (2⁷⁸ bytes) — Maximum size of any zpool
2⁵⁶ — Number of attributes of a file (actually constrained to 2⁴⁸ for the number of files in a ZFS file system)
2⁶⁴ — Number of devices in any zpool
2⁶⁴ — Number of zpools in a system
2⁶⁴ — Number of file systems in a zpool

Copy-on-write transactional model

ZFS uses a copy-on-write

Copy-on-write

transactional

Transaction processing

In computer science, transaction processing is information processing that is divided into individual, indivisible operations, called transactions. Each transaction must succeed or fail as a complete unit; it cannot remain in an intermediate state...

object model

Object model

In computing, object model has two related but distinct meanings:# The properties of objects in general in a specific computer programming language, technology, notation or methodology that uses them. For example, the Java objects model, the COM object model, or the object model of OMT...

. All block pointers within the filesystem contain a 256-bit checksum

Checksum

A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...

or 256-bit hash

Cryptographic hash function

A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will change the hash value...

(currently a choice between Fletcher-2

Fletcher's checksum

, Fletcher-4

Fletcher's checksum

, or SHA-256) of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any metadata

Metadata

The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log

Intent log

An intent log is a mechanism used to make computer operations more resilient in the event of failures. They are used in database software, transaction managers, and some file systems. In database area, transaction log is widely used...

is used when synchronous write semantics are required. The blocks are arranged in a tree, as are their checksums (see Merkle signature scheme

Merkle signature scheme

The Merkle signature scheme is a digital signature scheme based on hash trees and one-time signatures such as the Lamport signature scheme. It was developed by Ralph Merkle in the late 70s and is an alternative to traditional digital signatures such as the Digital Signature Algorithm or RSA...

Snapshots and clones

An advantage of copy-on-write is that when ZFS writes new data, the blocks containing the old data can be retained, allowing a snapshot

Snapshot (computer storage)

version of the file system to be maintained. ZFS snapshots are created very quickly, since all the data composing the snapshot is already stored; they are also space efficient, since any unchanged data is shared among the file system and its snapshots.

Writeable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist. This is an implementation of the Copy-on-write

Copy-on-write

principle.

Dynamic striping

Data striping

In computer data storage, data striping is the technique of segmenting logically sequential data, such as a file, in a way that accesses of sequential segments are made to different physical storage devices. Striping is useful when a processing device requests access to data more quickly than a...

across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus all disks in a pool are used, which balances the write load across them.

Variable block sizes

ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks.
If data compression

Data compression

In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....

(LZJB

LZJB

LZJB is a lossless data compression algorithm invented by Jeff Bonwick to compress crash dumps and data in ZFS. It includes a number of improvements to the LZRW1 algorithm, a member of the Lempel-Ziv family of compression algorithms.-External links:* * *...

) is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).

Lightweight filesystem creation

In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or resize a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.

Cache management

ZFS also uses the ARC

Adaptive Replacement Cache

, a new method for Read cache management, instead of the traditional Solaris virtual memory page cache

Page cache

In computing, page cache, sometimes ambiguously called disk cache, is a "transparent" buffer of disk-backed pages kept in main memory by the operating system for quicker access. Page cache is typically implemented in kernels with the paging memory management, and is completely transparent to...

. For Write cache ZFS employs the Intent Log (ZIL). ZFS makes allowances for both of these methods to incorporate separate virtual devices to improve the total IOPS. For Read operations it is the "cache" vdev and for Write operations it is the "log" vdev.

Adaptive endianness

Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an endian

Endianness

In computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...

-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness does not match the endianness of the system, the metadata is byte-swapped in memory.

This does not affect the stored data itself; as is usual in POSIX

POSIX

POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...

systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

Deduplication

Data deduplication

In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link...

capability was added to the ZFS source repository at the end of October 2009. The OpenSolaris ZFS development packages have been available since December 3, 2009 (build 128).

Effective use of deduplication requires additional hardware. ZFS designers recommend 2 GB of RAM for every 1 TB of storage. Example: at least 32 GB of memory is recommended for 20 TB of storage. If RAM is lacking, consider adding an SSD

Solid-state drive

A solid-state drive , sometimes called a solid-state disk or electronic disk, is a data storage device that uses solid-state memory to store persistent data with the intention of providing access in the same manner of a traditional block i/o hard disk drive...

as a cache, which will automatically handle the large de-dupe tables. This can speed up de-dupe performance 8x or more. Insufficient physical memory or lack of ZFS cache results in virtual memory thrashing, which lowers performance.

As of today with Solaris 11 Express, deduplication can cause several problems if you are not aware of the dedup limitations.

Encryption

The encryption capability in ZFS is embedded into the I/O pipeline. During writes a block may be compressed, encrypted, checksummed and then deduplicated in that order. The policy for encryption is set at the dataset level when datasets (file systems or ZVOLs) are created. The wrapping keys provided by the user/administrator can be changed at any time without taking the file system off line. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys. A command to switch to a new data encryption key for the clone or at any time is provided — this does not re-encrypt already existing data.

Additional capabilities

Explicit I/O priority with deadline scheduling.
Claimed globally optimal I/O sorting and aggregation.
Multiple independent prefetch streams with automatic length and stride detection.
Parallel, constant-time directory operations.
End-to-end checksumming, using a kind of "Data Integrity Field
Data Integrity Field
DIF stands for Data Integrity Field. The purpose of this field is to provide End-to-End data protection in Computer/Enterprise data storage methodology....

", allowing data corruption detection (and recovery if you have redundancy in the pool).
Transparent filesystem compression. Supports LZJB
LZJB
LZJB is a lossless data compression algorithm invented by Jeff Bonwick to compress crash dumps and data in ZFS. It includes a number of improvements to the LZRW1 algorithm, a member of the Lempel-Ziv family of compression algorithms.-External links:* * *...

and gzip
Gzip
Gzip is any of several software applications used for file compression and decompression. The term usually refers to the GNU Project's implementation, "gzip" standing for GNU zip. It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv and Huffman coding...

.
Intelligent scrubbing
Data scrubbing
Data scrubbing is an error correction technique which uses a background task that periodically inspects memory for errors, and then corrects the error using ECC memory or another copy of the data...

and resilvering
Disk mirroring
In data storage, disk mirroring or RAID1 is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability...

(resyncing).
Load and space usage sharing among disks in the pool.
Ditto blocks: Configurable data replication per filesystem, with zero, one or two extra copies requested per write for user data, and with that same base number of copies plus one or two for metadata (according to metadata importance). If the pool has several devices, ZFS tries to replicate over different devices. Ditto blocks are primarily an additional protection against corrupted sectors, not against total disk failure.
ZFS design (copy-on-write + superblocks) is safe when using disks with write cache enabled, if they honor the write barriers. This feature provides safety and a performance boost compared with some other filesystems.
When entire disks are added to a ZFS pool, ZFS automatically enables their write cache. This is not done when ZFS only manages discrete slices of the disk, since it does not know if other slices are managed by non-write-cache safe filesystems, like UFS
Unix File System
The Unix file system is a file system used by many Unix and Unix-like operating systems. It is also called the Berkeley Fast File System, the BSD Fast File System or FFS...

.
Per-user and per-group quotas support.
Filesystem encryption since Solaris 11 Express
Pools can be imported readonly
At import time a recovery by rolling back whole transactions is possible.
Planned features:
- The so-called Block Pointer rewrite functionality is due to be added in the same time frame, paving the way for resizing pools, defragmentation, (re-)applying compression on filesystems and so on.

Limitations

Capacity expansion is normally achieved by adding groups of disks as a top-level vdev: simple device, RAID-Z, RAID-Z2, RAID-Z3, or mirrored. Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to heal itself — the heal time will depend on the amount of stored information, not the disk size. The new free space will not be available until all the disks have been swapped.
It is currently not possible to reduce the number of top-level vdevs in a pool nor otherwise reduce pool capacity. This functionality was said to be in development already in 2007. It is not available as of Solaris 10 9/10 (AKA update 9).
It is not possible to add a disk as a column to a RAID-Z, RAID-Z2, or RAID-Z3 vdev. This feature depends on the block pointer rewrite functionality due to be added soon. One can however create a new RAID-Z vdev and add it to the zpool.
Vdevs cannot be nested, so a mirror or RAID-Z top-level vdev can only contain files or disks. Mirrors of mirrors (or other combinations) are not allowed.
Reconfiguring the number of devices in a top-level vdev requires copying data offline, destroying the pool, and recreating the pool with the new top-level vdev configuration, except for adding extra redundancy to an existing mirror, which can be done at any time or if all top level vdevs are mirrors with sufficient redundancy the zpool split command can be used to remove a vdev from each top level vdev in the pool, creating a 2nd pool with identical data.
If you use a single disk, by default, ZFS can only detect and report silent data corruption errors (because of the checksums) but not repair the errors. For ZFS to be able to both detect and also repair the data corruption, you must specify "copies=2" http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection which tells ZFS to store data twice on the disk (halving your storage capacity). If a data block gets corrupted, ZFS will repair the data block from another copy. Of course, "copies" does not help against a disk crash. To recover from a disk crash, you need disk redundancy such as raidz1, raidz2 or mirror. This applies to all file systems; no file system can protect your data against a disk crash when you use a single disk. You need two or more disks. Thus, ZFS "copies" is not a limitation, but a great advantage because of ZFS' ability to repair corrupted data even when using only a single disk. To further increase safety, "copies=3" can be used, which stores data thrice on every disk.
Resilver (repair) of a crashed disk in a ZFS raid takes a long time. This applies to all types of RAID, in one way or another. This means that future large disks, say 5 TB or 6 TB, can take several days to repair. This means that raidz1 (similar to RAID-5) should be avoided, because repairing a raid puts additional stress on the other disks which might cause them to crash, losing all data in the storage pool if configured as raidz1. Therefore, with large disks one should use raidz2 (allow two disks to crash) or raidz3 (allow three disks to crash). Adam Leventhal explains this problem further http://dtrace.org/blogs/ahl/2009/07/21/triple-parity-raid-z/. It should be noted, however, that ZFS RAID differs from conventional RAID solutions by only reconstructing the data when replacing a disk, not the entirety of the disk, which means that replacing a member disk on a ZFS pool that is half full will take only half the time as compared to conventional RAID.
IOPS
IOPS
IOPS is a common performance measurement used to benchmark computer storage devices like hard disk drives , solid state drives , and storage area networks...

performance of a ZFS storage pool can suffer if the ZFS raid is not appropriately configured. This applies to all types of RAID, in one way or another. If the zpool consists of only one group of disks configured as, say, raidz2 - then the IOPS performance will be that of a single disk. This means, to get high IOPS performance, the zpool should consist of several vdevs, because one vdev gives the IOPS of a single disk. However, there are ways to mitigate this IOPS performance problem, for instance add SSDs as L2ARC cache — which can boost IOPS into 100.000s http://blogs.sun.com/brendan/entry/a_quarter_million_nfs_iops . In short, a zpool should consist of several groups of vdevs, each vdev consisting of 8-12 disks. It is not recommended to create a zpool with a single large vdev, say 20 disks, because IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives).

Solaris 10

ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC

SPARC

SPARC is a RISC instruction set architecture developed by Sun Microsystems and introduced in mid-1987....

and x86-based systems. Since the code for ZFS is open source, a port to other operating systems and platforms can be produced without Sun's involvement.

Solaris 11

After Oracle's Solaris 11 Express release, the OS/Net consolidation (the main OS code) was made proprietary and closed-source, and further ZFS upgrades and implementations inside Solaris (such as encryption) are not compatible with other non-proprietary implementations which use previous versions of ZFS.

When creating a new ZFS pool, to retain the ability to use access the pool from other non-proprietary Solaris-based distributions, it is recommended to upgrade to Solaris 11 Express from OpenSolaris (snv_134b), and thereby stay at ZFS version 28.

OpenSolaris

2008.05 and 2009.06 use ZFS as their default filesystem. There are over a dozen 3rd party distributions, of which nearly a dozen are mentioned here. (OpenIndiana

OpenIndiana

OpenIndiana is a Unix-like computer operating system released as free and open source software. It forked from OpenSolaris after the discontinuation of that project by Oracle and aims to continue development and distribution of the OpenSolaris codebase. The project operates under the umbrella of...

and Illumos

Illumos

Illumos is a derivative of OS/Net , which basically is a Solaris/OpenSolaris kernel with the bulk of the drivers, core libraries, and basic utilities. It is dependent on OS/Net, which Illumos will follow very closely while allowing to retain changes to code which might be unacceptable to upstream...

are two new distributions not included on the OpenSolaris distribution reference page.)

OpenIndiana

148 and 151 use ZFS version 28, as implemented in Illumos

Illumos

.

By upgrading from OpenSolaris snv_134 to both OpenIndiana and Solaris 11 Express, one also has the ability to upgrade and separately boot Solaris 11 Express on the same ZFS pool, but one should not install Solaris 11 Express first because of ZFS incompatibilities introduced by Oracle past ZFS version 28.

FreeBSD

Pawel Jakub Dawidek ported ZFS to FreeBSD

FreeBSD

FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...

, and it has been part of FreeBSD since version 7.0. This includes zfsboot, which allows booting FreeBSD directly from a ZFS volume.

FreeBSD's ZFS implementation is fully functional; the only missing features are kernel CIFS server and iSCSI

ISCSI

In computing, iSCSI , is an abbreviation of Internet Small Computer System Interface, an Internet Protocol -based storage networking standard for linking data storage facilities. By carrying SCSI commands over IP networks, iSCSI is used to facilitate data transfers over intranets and to manage...

, but at least the latter can be added using externally available packages. A CIFS server can be emulated in user space using Samba

Samba (software)

Samba is a free software re-implementation, originally developed by Andrew Tridgell, of the SMB/CIFS networking protocol. As of version 3, Samba provides file and print services for various Microsoft Windows clients and can integrate with a Windows Server domain, either as a Primary Domain...

.

FreeBSD 7-stable (where updates to the series of versions 7.x are committed to) uses zpool version 6.

FreeBSD version 8 includes a much-updated implementation of ZFS, and zpool version 13 is supported in FreeBSD release 8.0. zpool version 14 support was added to the 8-stable branch on 11 January 2010, and is included in FreeBSD release 8.1. zpool version 15 is supported in release 8.2.
The 8-stable branch gained support for zpool version v28 and zfs version 5 in early June 2011. Therefore, v28 will be supported in the 8.x FreeBSD series with the release of FreeBSD 8.3.

The 9-current development branch of FreeBSD uses ZFS Pool version 28.

FreeNAS

FreeNAS is a free network-attached storage server, supporting: CIFS , FTP, NFS, rsync, AFP protocols, iSCSI, S.M.A.R.T., local user authentication, and software RAID , with a web-based configuration interface. FreeNAS takes less than 64 MB once installed on CompactFlash, hard drive or USB flash...

, an embedded open source network-attached storage

Network-attached storage

Network-attached storage is file-level computer data storage connected to a computer network providing data access to heterogeneous clients. NAS not only operates as a file server, but is specialized for this task either by its hardware, software, or configuration of those elements...

(NAS) distribution based on FreeBSD

FreeBSD

, has the same ZFS support as FreeBSD.

GNU/kFreeBSD

Being based on the FreeBSD kernel, GNU/kFreeBSD has ZFS support from the kernel. However, it depends on the distribution of GNU/kFreeBSD whether the necessary userland tools are available. The only distribution of this system to the date (Debian GNU/kFreeBSD

Debian GNU/kFreeBSD

Debian GNU/kFreeBSD is an operating system released by the Debian project. It uses the kernel of FreeBSD combined with a GNU based userland. The majority of software in Debian GNU/kFreeBSD is built from the same sources as Debian GNU/Linux. The k in kFreeBSD refers to the fact that only the kernel...

) provides ZFS utilities in the zfsutils package. Additionally, the Debian installer supports installing the operating system under ZFS on the amd64 architecture.

NetBSD

The NetBSD ZFS port was started as a part of the 2007 Google Summer of Code

Google Summer of Code

The Google Summer of Code is an annual program, first held from May to August 2005, in which Google awards stipends to hundreds of students who successfully complete a requested free or open-source software coding project during the summer...

and in August 2009 the code was merged into NetBSD

NetBSD

NetBSD is a freely available open source version of the Berkeley Software Distribution Unix operating system. It was the second open source BSD descendant to be formally released, after 386BSD, and continues to be actively developed. The NetBSD project is primarily focused on high quality design,...

's source tree.

Mac OS X

The first indication of Apple Inc.'s interest in ZFS was an April 2006 post on the opensolaris.org zfs-discuss mailing list where an Apple employee mentioned being interested in porting ZFS to their Mac OS X

Mac OS X

Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

operating system.

In the release version of Mac OS X 10.5, ZFS was available in read-only mode from the command line, which lacks the possibility to create zpools or write to them. Before the 10.5 release, Apple released the "ZFS Beta Seed v1.1", which allowed read-write access and the creation of zpools, however the installer for the "ZFS Beta Seed v1.1" has been reported to only work on version 10.5.0, and has not been updated for version 10.5.1 and above.

In August 2007, Apple opened a ZFS project on their Mac OS Forge site. On that site, Apple provided the source code and binaries of their port of ZFS which includes read-write access, but there was no installer available until a third-party developer created one.

In October 2009, Apple announced a shutdown of the ZFS project on Mac OS Forge. No explanation was given, just the following statement: "The ZFS project has been discontinued. The mailing list and repository will also be removed shortly." Versions of the previously released source and binaries, as well as the wiki, have been preserved and development has been adopted by a group of enthusiasts.

Complete ZFS support was once advertised as a feature of Snow Leopard Server (Mac OS X Server

Mac OS X Server

Mac OS X Server is a Unix server operating system from Apple Inc. The server edition of Mac OS X is architecturally identical to its desktop counterpart, except that it includes work group management and administration software tools...

10.6). However, all references to this feature have been silently removed; it is no longer listed on the Snow Leopard Server features page. Apple has not commented regarding the omission.

The maczfs project mirrored the public archives before they disappeared, and a community-maintained project currently (as of 5 May 2011) provides basic ZFS software for most recent versions of OS X, including Lion (10.7)

In March 2011, the company Ten's Complement LLC (founded by Don Brady, a former Apple engineer who was technical lead on the original HFS+ team and worked on Apple's abandoned internal project to port ZFS) announced that it was close to releasing a version of ZFS for Mac OS X called "Z410 Storage". Z410 Storage would be targeted at prosumer

Prosumer

Prosumer is a portmanteau formed by contracting either the word professional or less often, producer with the word consumer. For example, a prosumer grade digital camera is a "cross" between consumer grade and professional grade...

s. As of 15 August 2011 Z-410 is in a private beta state and supports zpool v28.

Linux

Porting ZFS to Linux

Linux

Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

is complicated by the fact that the GNU General Public License

GNU General Public License

The GNU General Public License is the most widely used free software license, originally written by Richard Stallman for the GNU Project....

, which governs the Linux kernel

Linux kernel

The Linux kernel is an operating system kernel used by the Linux family of Unix-like operating systems. It is one of the most prominent examples of free and open source software....

, is incompatible with the Sun CDDL under which ZFS is distributed. According to some developers a single derived work of both projects cannot be legally distributed, as it is not possible to simultaneously meet both licenses' requirements. To include ZFS in the Linux kernel it would have to be cleanly reimplemented, and patents may hamper this.

Linux FUSE

Another solution to this problem was to port ZFS to Linux's FUSE system so the filesystem runs in userspace instead, where it is not considered a derived work of the kernel. A project to do this was sponsored by Google's Summer of Code

Google Summer of Code

program in 2006. The original ZFS on FUSE project is available here. Development for ZFS on FUSE/Linux now takes place at zfs-fuse.net.

Native ZFS on Linux

A native port of ZFS for Linux is in development. This ZFS on Linux port was produced at the Lawrence Livermore National Laboratory

Lawrence Livermore National Laboratory

The Lawrence Livermore National Laboratory , just outside Livermore, California, is a Federally Funded Research and Development Center founded by the University of California in 1952...

(LLNL) under Contract No. DE-AC52-07NA27344 (Contract 44) between the U.S. Department of Energy (DOE) and Lawrence Livermore National Security, LLC (LLNS) for the operation of LLNL. It has been approved for release under LLNL-CODE-403049. The port is currently in release candidate status for version 0.6.0, which supports mounting filesystems.

Another native port was being worked on by KQ Infotech . This port used the LLNL ZVOL implementation as a starting point. A GA release supporting zpool v28 was released in January 2011. In mid-2011, KQ Infotech was acquired by another company, and as such their work on ZFS had ceased. Their code can be found on github

Github

GitHub is a web-based hosting service for software development projects that use the Git revision control system. GitHub offers both commercial plans and free accounts for open source projects...

Comparisons

List of Operating Systems, Distros and add-ons that support ZFS, the zpool version it supports, and the Solaris build they are based on (if any):

OS	Zpool version	Sun/Oracle Build #	Comments
Oracle Solaris Express 11 2010.11	31	b151a	licensed for testing only
OpenSolaris OpenSolaris OpenSolaris was an open source computer operating system based on Solaris created by Sun Microsystems. It was also the name of the project initiated by Sun to build a developer and user community around the software... 2009.06	14	b111b
OpenSolaris OpenSolaris OpenSolaris was an open source computer operating system based on Solaris created by Sun Microsystems. It was also the name of the project initiated by Sun to build a developer and user community around the software... (last dev)	22	b134
OpenIndiana OpenIndiana OpenIndiana is a Unix-like computer operating system released as free and open source software. It forked from OpenSolaris after the discontinuation of that project by Oracle and aims to continue development and distribution of the OpenSolaris codebase. The project operates under the umbrella of...	28	b147	OpenIndiana creates a name clash with naming their code b151a
Nexenta Core 3.0.1	26	b134+	GNU userland
NexentaStor NexentaStor NexentaStor is a proprietary derivative operating system built by the developers of the open-source Nexenta OpenSolaris-distribution that has been optimized for use virtualized server environments NAS and iSCSI and Fibre Channel applications built around the ZFS file system... Community 3.1.0	28	b134+	GNU userland
NexentaStor NexentaStor NexentaStor is a proprietary derivative operating system built by the developers of the open-source Nexenta OpenSolaris-distribution that has been optimized for use virtualized server environments NAS and iSCSI and Fibre Channel applications built around the ZFS file system... Community 3.0.1	26	b134+	up to 18 TB, web admin
NexentaStor Enterprise	28	b134 +	not free, web admin
FreeBSD FreeBSD FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant... 8.2-RELEASE	15		no CIFS or iSCSI
FreeBSD FreeBSD FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant... 8-STABLE / 9-CURRENT	28		no CIFS or iSCSI
Linux FUSE 0.7.0	23		low efficiency
Native Linux port (LLNL)	28		no stable POSIX layer, release candidate has basic POSIX layer
Native Linux port (KQ Infotech)	28		includes POSIX layer
Belenix BeleniX BeleniX is an operating system distribution built using the OpenSolaris source base. It can be used as a Live CD as well as installed to hard disk. From the information provided on the BeleniX website, it may appear that currently, BeleniX is compiled only for 32-bit execution... 0.8b1	14	b111
Schillix SchilliX SchilliX is a Live CD operating system distribution based on OpenSolaris. It was released on 17 June 2005, three days after the first release of OpenSolaris. Its developers claim that it is the first OpenSolaris distribution... 0.7.2	28	b147
StormOS "hail"			based on Nexenta
Jaris			Japanese
MilaX MilaX MilaX is an OpenSolaris Live CD distro designed to fit on a business-card sized miniCD. x86 and SPARC versions are downloadable in bootable ISO 9660 and USB disk images. The Live CD can also be used to install the operating system to a hard disk through with ZFS-boot support. MilaX can be... 0.5	20	b128a	small size
FreeNAS 8.0.2	15
Korona 4.5.0	22	b134	KDE
EON NAS	22	b130	embedded NAS
Mac OS X Mac OS X Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems... 10.6 (kernel extension / module)	8		Somewhat stable with installable packages for those who wish to use it and test, 1 reported crash. Project Page
Mac OS X Mac OS X Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems... 10.6/10.7 (Z-410)			Commercial/nonfree port. Currently (November 2011) in beta. http://tenscomplement.com/z-410-storage-main-features

(updated 2011/11/26)

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

History

Version numbers

Data Integrity

Storage pools

Capacity

Copy-on-write transactional model

Snapshots and clones

Dynamic striping

Variable block sizes

Lightweight filesystem creation

Cache management

Adaptive endianness

Deduplication

Encryption

Additional capabilities

Limitations

Solaris 10

Solaris 11

OpenSolaris

OpenIndiana

FreeBSD

FreeNAS

GNU/kFreeBSD

NetBSD

Mac OS X

Linux

Linux FUSE

Native ZFS on Linux

Comparisons

See also

External links