Data consistency - AbsoluteAstronomy.com

Data consistency summarizes the validity, accuracy, usability and integrity of related data between applications and across an IT enterprise. This ensures that each user observes a consistent view of the data, including visible changes made by the user's own transactions

Database transaction

A transaction comprises a unit of work performed within a database management system against a database, and treated in a coherent and reliable way independent of other transactions...

and transactions of other users or processes. Data Consistency problems may arise at any time but are frequently introduced during or following recovery situations when backup

Backup

In information technology, a backup or the process of backing up is making copies of data which may be used to restore the original after a data loss event. The verb form is back up in two words, whereas the noun is backup....

copies of the data are used in place of the original data.

Various kinds of data consistency have been identified. These include Application Consistency , Transaction Consistency and Point-in-Time (PiT) Consistency.

Point-in-time consistency

Data is point-in-time consistent if all of the interrelated data components (either a group of data sets or a set of logical volumes) are as they were at any single instant in time.

Point-in-time consistency is an important property of backup files and a critical objective of software that creates backups. It is also relevant to the design of disk memory systems, specifically relating to what happens when they are unexpectedly shut down.

As a relevant backup example, consider a website with a database such as the online encyclopedia Wikipedia

Wikipedia

Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

, which needs to be operational around the clock, but also must be backed up with regularity to protect against disaster. Portions of Wikipedia are constantly being updated every minute of every day, meanwhile, Wikipedia's database is stored on servers in the form of one or several very large files which require minutes or hours to back up.

These large files - as with any database - contain numerous data structures which reference each other by location. For example, some structures are indexes

Index (database)

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

which permit the database subsystem to quickly find search results. If the data structures cease to reference each other properly, then the database can be said to be corrupted.

Counter example

The importance of point-in-time consistency can be illustrated with what would happen if a backup were made without it.

Assume Wikipedia's database is a huge file, which has an important index located 20% of the way through, and saves article data at the 75% mark. Consider a scenario where an editor comes and creates a new article at the same time a backup is being performed, which is being made as a simple "file copy" which copies from the beginning to the end of the large file(s) and doesn't consider data consistency - and at the time of the article edit, it is 50% complete. The new article is added to the article space (at the 75% mark) and a corresponding index entry is added (at the 20% mark).

Because the backup is already halfway done and the index already copied, the backup will be written with the article data present, but with the index reference missing. As a result of the inconsistency, this file is considered corrupted.

In real life, a real database such as Wikipedia's may be edited thousands of times per hour, and references are virtually always spread throughout the file and can number into the millions, billions, or more. A sequential "copy" backup would literally contain so many small corruptions that the backup would be completely unusable without a lengthy repair process which could provide no guarantee as to the completeness of what has been recovered.

A backup process which properly accounts for data consistency ensures that the backup is a snapshot of how the entire database looked at a single moment. In the given Wikipedia example, it would ensure that the backup was written without the added article at the 75% mark, so that the article data would be consistent with the index data previously written.

Disk caching systems

Point-in-time consistency is also relevant to computer disk subsystems.

Specifically, operating system

Operating system

An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

s and file system

File system

A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...

s are designed with the expectation that the computer system they are running on could lose power, crash, fail, or otherwise cease operating at any time. When properly designed, they ensure that data will not be unrecoverably corrupted if the power is lost. Operating systems and file systems do this by ensuring that data is written to a hard disk in a certain order, and rely on that in order to detect and recover from unexpected shutdowns.

On the other hand, rigorously writing data to disk in the order that maximizes data integrity also impacts performance. A process of write caching is used to consolidate and re-sequence write operations such that they can be done faster by minimizing the time spent moving disk heads.

Data consistency concerns arise when write caching changes the sequence in which writes are carried out, because it there exists the possibility of an unexpected shutdown that violates the operating system's expectation that all writes will be committed sequentially.

For example, in order to save a typical document or picture file, an operating system might write the following records to a disk in the following order:

Journal entry saying file XYZ is about to be saved into sector 123.
The actual contents of the file XYZ are written into sector 123.
Sector 123 is now flagged as occupied in the record of free/used space.
Journal entry noting the file completely saved, and its name is XYZ and is located in sector 123.

The operating system relies on the assumption that if it sees item #1 is present (saying the file is about to be saved), but that item #4 is missing (confirming success), that the save operation was unsuccessful and so it should undo any incomplete steps already taken to save it (e.g. marking sector 123 free since it never was properly filled, and removing any record of XYZ from the file directory). It relies on these items being committed to disk in sequential order.

Suppose a caching algorithm determines it would be fastest to write these items to disk in the order 4-3-1-2, and starts doing so, but the power gets shut down after 4 get written, before 3, 1 and 2, and so those writes never occur. When the computer is turned back on, the file system would then show it contains a file named XYZ which is located in sector 123, but this sector really does not contain the file. (Instead, the sector will contain garbage, or zeroes, or a random portion of some old file - and that is what will show if the file is opened).

Further, the file system's free space map will not contain any entry showing that sector 123 is occupied, so later, it will likely assign that sector to the next file to be saved, believing it is available. The file system will then have two files both unexpectedly claiming the same sector (known as a cross-linked file). As a result, a write to one of the files will overwrite part of the other file, invisibly damaging it.

A disk caching subsystem that ensures point-in-time consistency guarantees that in the event of an unexpected shutdown, the four elements would be written one of only five possible ways: completely (1-2-3-4), partially (1, 1-2, 1-2-3), or not at all.

High-end hardware disk controllers of the type found in servers include a small battery back-up unit on their cache memory so that they may offer the performance gains of write caching while mitigating the risk of unintended shutdowns. The battery back-up unit keeps the memory powered even during a shutdown so that when the computer is powered back up, it can quickly complete any writes it has previously committed. With such a controller, the operating system may request four writes (1-2-3-4) in that order, but the controller may decide the quickest way to write them is 4-3-1-2. The controller essentially lies to the operating system and reports that the writes have been completed in order (a lie that improves performance at the expense of data corruption if power is lost), and the battery backup hedges against the risk of data corruption by giving the controller a way to silently fix any and all damage that could occur as a result.

If the power gets shut off after element 4 has been written, the battery backed memory contains the record of commitment for the other three items and ensures that they are written ("flushed") to the disk at the next available opportunity.

Transaction Consistency

A transaction

Database transaction

A transaction comprises a unit of work performed within a database management system against a database, and treated in a coherent and reliable way independent of other transactions...

is a logical unit of work that may include any number of file or database updates. Transaction consistency is also frequently referred to as atomicity

Atomicity

In database systems, atomicity is one of the ACID transaction properties. In an atomic transaction, a series of database operations either all occur, or nothing occurs...

.

A good example of the importance of transaction consistency is a database that handles the transfer of money.

Suppose a money transfer requires two operations: writing a debit in one place, and a credit in another.

If the system crashes or shuts down when one operation has completed but the other has not, and there is nothing in place to correct this, the system can be said to lack transaction consistency.

With a money transfer, it is desirable that either the entire transaction completes, or none of it completes. Both of these scenarios keep the balance in check.

Transaction consistency ensures just that - that a system is programmed to be able to detect incomplete transactions when powered on, and undo (or "roll back") the portion of any incomplete transactions that are found.

Application Consistency

Application

Application software

Application software, also known as an application or an "app", is computer software designed to help the user to perform specific tasks. Examples include enterprise software, accounting software, office suites, graphics software and media players. Many application programs deal principally with...

Consistency is similar to Transaction consistency, but instead of data consistency within the scope of a single transaction, data must be consistent within the confines of many different transaction streams from one or more applications.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.