Replication is the process of sharing information so as to ensure consistency between redundant resources, such as software or
hardwareHardware is a general term for the physical artifacts of a technology. It may also mean the physical components of a computer system, in the form of computer hardware....
components, to improve reliability, fault-tolerance, or accessibility. It could be
data replication if the same data is stored on multiple
storage devicethumb|200px|right|A reel-to-reel tape recorder .The magnetic tape is a data storage medium. The recorder is data storage equipment using a portable medium to store the data....
s, or
computation replication if the same computing task is executed many times. A computational task is typically
replicated in space, i.e. executed on separate devices, or it could be
replicated in time, if it is executed repeatedly on a single device.
The access to a replicated entity is typically uniform with access to a single, non-replicated entity. The replication itself should be
transparentAny change in a computing system, such as new feature or new component, is transparent if the system after change adheres to previous external interface as much as possible while changing its internal behaviour. The purpose is to shield from change all systems on the other end of the interface...
to an external user. Also, in a failure scenario, a
failoverIn computing, failover is the capability to switch over automatically to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application,server, system, or network...
of replicas is hidden as much as possible.
It is common to talk about active and passive replication in systems that replicate data or services.
Active replication is performed by processing the same request at every replica. In
passive replication, each single request is processed on a single replica and then its state is transferred to the other replicas. If at any time one master replica is designated to process all the requests, then we are talking about the
primary-backup scheme (
master-slave scheme) predominant in
high-availability clusterHigh-availability clusters are computer clusters that are implemented primarily for the purpose of providing high availability of services which the cluster provides. They operate by having redundant computers or nodes which are then used to provide service when system components fail...
s. On the other side, if any replica processes a request and then distributes a new state, then this is a
multi-primary scheme (called
multi-masterMulti-master replication is a method of database replication which allows data to be stored by a group of computers, and updated by any member of the group. The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group, and...
in the database field). In the multi-primary scheme, some form of
distributed concurrency controlDistributed concurrency control relates to the concurrency control of a system distributed over a computer network.In database systems and transaction processing distributed concurrency control relates primarily to the concurrency control of a distributed database...
must be used, such as
distributed lock managerA distributed lock manager provides distributed applications with a means to synchronize their accesses to shared resources.DLMs have been used as the foundation for several successful clustered file systems, in which the machines in a cluster can use each other's storage via a unified file...
.
Load balancingIn computer networking, load balancing is a technique to distribute workload evenly across two or more computers, network links, CPUs, hard drives, or other resources, in order to get optimal resource utilization, maximize throughput, minimize response time, and avoid overload...
is different from task replication, since it distributes a load of different (not the same) computations across machines, and allows a single computation to be dropped in case of failure. Load balancing, however, sometimes uses data replication (esp. multi-master) internally, to distribute its data among machines.
BackupIn information technology, a backup or the process of backing up refer to making copies of data so that these additional copies may be used to restore the original after a data loss event...
is different from replication, since it saves a copy of data unchanged for a long period of time. Replicas on the other hand are frequently updated and quickly lose any historical state.
Replication in distributed systems
Replication is one of the oldest and most important topics in the overall area of
distributed systemsDistributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...
.
Whether one replicates data or computation, the objective is to have some group of processes that handle incoming events. If we replicate data, these processes are passive and operate only to maintain the stored data, reply to read requests, and apply updates. When we replicate computation, the usual goal is to provide fault-tolerance. For example, a replicated service might be used to control a telephone switch, with the objective of ensuring that even if the primary controller fails, the backup can take over its functions. But the underlying needs are the same in both cases: by ensuring that the replicas see the same events in equivalent orders, they stay in consistent states and hence any replica can respond to queries.
Replication models in distributed systems
A number of widely cited models exist for data replication, each having its own properties and performance:
- Transactional replication. This is the model for replicating transactional data, for example a database or some other form of transactional storage structure. The one-copy serializability model is employed in this case, which defines legal outcomes of a transaction on replicated data in accordance with the overall ACID
In computer science, ACID is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction...
properties that transactional systems seek to guarantee.
- State machine replication
-State Machine Definition:For the subsequent discussion a State Machine will be defined as the following tuple of values :* A set of States* A set of Inputs* A set of Outputs...
. This model assumes that replicated process is a deterministic finite state machineIn the theory of computation, a deterministic finite state machine is a finite state machine where for each pair of state and input symbol there is one and only one transition to a next state...
and that atomic broadcastIn distributed systems, atomic broadcast or total order broadcast is a broadcast messaging protocol that ensures that messages are received reliably and in the same order by all participants ....
of every event is possible. It is based on a distributed computing problem called distributed consensus and has a great deal in common with the transactional replication model. This is sometimes mistakenly used as synonym of active replication.
- Virtual synchrony
Virtual synchrony is an interprocess messaging passing technology. Virtual synchrony systems allow programs running in a network to organize themselves into process groups, and to send messages to groups...
. This computational model is used when a group of processes cooperate to replicate in-memory data or to coordinate actions. The model defines a new distributed entity called a process group. A process can join a group, which is much like opening a file: the process is added to the group, but is also provided with a checkpoint containing the current state of the data replicated by group members. Processes can then send events (multicasts) to the group and will see incoming events in the identical order, even if events are sent concurrently. Membership changes are handled as a special kind of platform-generated event that delivers a new membership view to the processes in the group.
Levels of performance vary widely depending on the model selected. Transactional replication is slowest, at least when one-copy serializability guarantees are desired (better performance can be obtained when a database uses log-based replication, but at the cost of possible inconsistencies if a failure causes part of the log to be lost). Virtual synchrony is the fastest of the three models, but the handling of failures is less rigorous than in the transactional model. State machine replication lies somewhere in between; the model is faster than transactions, but much slower than virtual synchrony.
The virtual synchrony model is popular in part because it allows the developer to use either active or passive replication. In contrast, state machine replication and transactional replication are highly constraining and are often embedded into products at layers where end-users would not be able to access them.
Database replication
DatabaseA database is an integrated collection of logically related records or files consolidated into a common pool that provides data for one or more multiple uses....
replication can be used on many
database management systemA Database Management System is a set of computer programs that controls the creation, maintenance, and the use of the database in a computer platform or of an organization and its end users. It allows organizations to place control of organization-wide database development in the hands of...
s, usually with a master/slave relationship between the original and the copies. The master logs the updates, which then ripple through to the slaves. The slave outputs a message stating that it has received the update successfully, thus allowing the sending (and potentially re-sending until successfully applied) of subsequent updates.
Multi-master replicationMulti-master replication is a method of database replication which allows data to be stored by a group of computers, and updated by any member of the group. The multi-master replication system is responsible for propagating the data modifications made by each member to the rest of the group, and...
, where updates can be submitted to any database node, and then ripple through to other servers, is often desired, but introduces substantially increased costs and complexity which may make it impractical in some situations. The most common challenge that exists in multi-master replication is transactional conflict prevention or resolution. Most synchronous or eager replication solutions do conflict prevention, while asynchronous solutions have to do conflict resolution. For instance, if a record is changed on two nodes simultaneously, an eager replication system would detect the conflict before confirming the commit and abort one of the transactions. A lazy replication system would allow both transactions to commit and run a conflict resolution during resynchronization. The resolution of such a conflict may be based on a timestamp of the transaction, on the hierarchy of the origin nodes or on much more complex logic, which decides consistently on all nodes.
Database replication becomes difficult when it scales up. Usually, the scale up goes with two dimensions, horizontal and vertical: horizontal scale up has more data replicas, vertical scale up has data replicas located further away in distance. Problems raised by horizontal scale up can be alleviated by a multi-layer multi-view access protocol. Vertical scale up is running into less trouble since internet reliability and performance are improving.
Disk storage replication
Active (real-time) storage replication is usually implemented by distributing updates of a block device to several physical
hard diskA hard disk drive is a non-volatile storage device that stores digitally encoded data on rapidly rotating platters with magnetic surfaces. Strictly speaking, "drive" refers to the motorized mechanical aspect that is distinct from its medium, such as a tape drive and its tape, or a floppy disk...
s. This way, any
file systemIn computing, a file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them...
supported by the
operating systemAn operating system is an interface between hardware and user which is responsible for the management and coordination of activities and the sharing of the resources of the computer that acts as a host for computing applications run on the machine. As a host, one of the purposes of an operating...
can be replicated without modification, as the file system code works on a level above the block device driver layer. It is implemented either in hardware (in a
disk array controllerA disk array controller is a device which manages the physical disk drives and presents them to the computer as logical units. It almost always implements hardware RAID, thus it is sometimes referred to as RAID controller. It also often provides additional disk cache.A disk array controller name is...
) or in software (in a
device driverIn computing, a device driver or software driver is a computer program allowing higher-level computer programs to interact with a hardware device....
).
The most basic method is
disk mirroringIn data storage, disk mirroring or RAID1 is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability...
, typical for locally-connected disks.
Notably, the storage industry narrows the definitions, so
mirroring is a local (short-distance) operation. A
replication is extendable across a
computer networkA computer network is a group of interconnected computers. Networks may be classified according to a wide variety of characteristics. This article provides a general overview of some types and categories and also presents the basic components of a network....
, so the disks can be located in physically distant locations. The purpose is to avoid damage done by, and improve availability in case of local failures or
disasterDisaster recovery is the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster....
s. Typically the above
master-slave theoretical replication model is applied. The main characteristic of such solutions is handling write operations:
- Synchronous
Synchronization or synchronisation is timekeeping which requires the coordination of events to operate a system in unison. The familiar conductor of an orchestra serves to keep the orchestra in time....
replication - guarantees "zero data loss" by the means of atomicAn atomic operation in computer science refers to a set of operations that can be combined so that they appear to the rest of the system to be a single operation with only two possible outcomes: success or failure.-Conditions:...
write operation, i.e. write either completes on both sides or not at all. Write is not considered complete until acknowledgement by both local and remote storage. Most applications wait for a write transaction to complete before proceeding with further work, hence overall performance decreases considerably. Inherently, performance drops proportionally to distance, as latencyLatency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured.-Packet-switched networks:...
is caused by speed of lightIn physics, the speed of light is a physical constant, the speed at which electromagnetic radiation, such as light, travels in free space . Its value is 299,792,458 metres per second...
. For 10 km distance, the fastest possible roundtrip takes 67 μs, whereas nowadays a whole local cached write completes in about 10-20 μs.
- An often-overlooked aspect of synchronous replication is the fact, that failure of remote replica or even just the interconnection stops by definition any and all writes (freezing the local storage system). This is the behaviour that guarantees zero data loss. However, many commercial systems at such potentially dangerous point do not freeze, but just proceed with local writes, losing the desired zero recovery point objective
Recovery Point Objective describes the acceptable amount of data loss measured in time.The Recovery Point Objective is the point in time to which you must recover data as defined by your organization. This is generally a definition of what an organization determines is an "acceptable loss" in a...
.
- Asynchronous
Asynchronous I/O, or non-blocking I/O, is a form of input/output processing that permits other processing to continue before the transmission has finished....
replication - write is considered complete as soon as local storage acknowledges it. Remote storage is updated, but probably with a small lagLatency is the time taken for a sent packet of data to be received at the other end. It includes the time to encode the packet for transmission and transmit it, the time for that data to traverse the network equipment between the nodes, and the time to receive and decode the data. This is also...
. Performance is greatly increased, but in case of losing a local storage, the remote storage is not guaranteed to have the current copy of data and most recent data may be lost.
- Semi-synchronous replication - this usually means that a write is considered complete as soon as local storage acknowledges it and a remote server acknowledges that it has received the write either into memory or to a dedicated log file. The actual remote write is not performed immediately but is performed asynchronously, resulting in better performance than synchronous replication but with increased risk of the remote write failing.
- Point-in-time replication - introduces periodic snapshot
In computer file systems, a snapshot is a copy of a set of files and directories as they were at a particular point in the past. The term was coined as an analogy to that in photography.- Rationale :...
s that are replicated instead of primary storage.
Most important implementations:
- DRBD
DRBD is a distributed storage system for the Linux platform. It consists of a kernel module, several userspace management applications and some shell scripts and is normally used on high availability clusters...
module for Linux.
- EMC SRDF
SRDF is a family of EMC products that facilitates the data replication from one Symmetrix storage array to another through a Storage Area Network or IP network....
- IBM PPRC
Peer to Peer Remote Copy or PPRC is a protocol to replicate a storage volume to another control unit in a remote site. Synchronous PPRC causes each write to the primary volume to be performed to the secondary as well, and the I/O is only considered complete when update to both primary and secondary...
and Global MirrorGlobal Mirror is an IBM technology that provides data replication over extended distances between two sites for business continuity and disaster recovery. If adequate bandwidth exists, Global Mirror provides an recovery point objective of as low as 3-5 seconds between the two sites at extended...
(known together as IBM Copy Services)
- Hitachi TrueCopy
Hitachi TrueCopy, formerly known as Hitachi Open Remote Copy or Hitachi Remote Copy or Hitachi Asynchronous Remote Copy , is a remote mirroring feature from Hitachi storage arrays available for both open systems and IBM z/OS.Synchronous TrueCopy causes each write to the primary volume to be...
- Symantec Veritas Volume Replicator
Veritas Software Corp. was an international software company that was founded in 1983 as Tolerant Systems, renamed Veritas Software Corp. in 1989, and merged with Symantec in 2005. It was headquartered in Mountain View, California...
(VVR)
- FalconStor Replication & Mirroring (sub-block heterogeneous point-in-time, async, sync)
Distributed shared memory replication
Another example of using replication appears in
distributed shared memoryDistributed Shared Memory , also known as a distributed global address space , is a term in computer science that refers to a wide class of software and hardware implementations, in which each node of a cluster has access to a large shared memory in addition to each node's limited non-shared...
systems, where it may happen that many nodes of the system share the same page of the memory - which usually means, that each node has a separate copy (replica) of this page.
Primary-backup and multi-primary replication
Many classical approaches to replication are based on a primary/backup model where one device or process has unilateral control over one or more other processes or devices. For example, the primary might perform some computation, streaming a log of updates to a backup (standby) process, which can then take over if the primary fails. This approach is the most common one for replicating databases, despite the risk that if a portion of the log is lost during a failure, the backup might not be in a state identical to the one the primary was in, and transactions could then be lost.
A weakness of primary/backup schemes is that in settings where both processes could have been active, only one is actually performing operations. We're gaining fault-tolerance but spending twice as much money to get this property. For this reason, starting in the period around 1985, the distributed systems research community began to explore alternative methods of replicating data. An outgrowth of this work was the emergence of schemes in which a group of replicas could cooperate, with each process backup up the others, and each handling some share of the workload.
Jim Gray, a towering figure within the database community, analyzed multi-primary replication schemes under the transactional model and ultimately published a widely cited paper skeptical of the approach (
"The Dangers of Replication and a Solution"). In a nutshell, he argued that unless data splits in some natural way so that the database can be treated as
n disjoint sub-databases, concurrency control conflicts will result in seriously degraded performance and the group of replicas will probably slow down as a function of
n. Indeed, he suggests that the most common approaches are likely to result in degradation that scales as
O(n³). His solution, which is to partition the data, is only viable in situations where data actually has a natural partitioning key.
The situation is not always so bleak. For example, in the 1985-1987 period, the
virtual synchronyVirtual synchrony is an interprocess messaging passing technology. Virtual synchrony systems allow programs running in a network to organize themselves into process groups, and to send messages to groups...
model was proposed and emerged as a widely adopted standard (it was used in the Isis Toolkit, Horus, Transis, Ensemble, Totem,
SpreadThe Spread Toolkit is a computer software package that provides a high performance group communication system that is resilient to faults across local and wide area networks. Spread functions as a unified message bus for distributed applications, and provides highly tuned application-level...
, C-Ensemble, Phoenix and Quicksilver systems, and is the basis for the CORBA fault-tolerant computing standard; the model is also used in IBM Websphere to replicate business logic and in Microsoft's Windows Server 2008
enterprise clusteringMicrosoft Cluster Server is software designed to allow servers to work together as computer cluster, to provide failover and increased availability of applications, or parallel calculating power in case of high-performance computing clusters .Microsoft has three technologies for clustering:...
technology). Virtual synchrony permits a multi-primary approach in which a group of processes cooperate to parallelize some aspects of request processing. The scheme can only be used for some forms of in-memory data, but when feasible, provides linear speedups in the size of the group.
A number of modern products support similar schemes. For example, the
Spread ToolkitThe Spread Toolkit is a computer software package that provides a high performance group communication system that is resilient to faults across local and wide area networks. Spread functions as a unified message bus for distributed applications, and provides highly tuned application-level...
supports this same virtual synchrony model and can be used to implement a multi-primary replication scheme; it would also be possible to use C-Ensemble or Quicksilver in this manner.
WANdiscoWANdisco, Inc. is a United States based Software Company specializing in Distributed Computing.The WANdisco Source Control Management suite comprises of MultiSite, Clustering, High Availability and Access Control products for CVS, Subversion and JIRA....
permits active replication where every node on a network is an exact copy or
replicaA replica is a copy that is relatively indistinguishable from the original. Replicas are often used for historical purposes, such as being placed in a museum. Sometimes the original never existed. For example, Difference Engine No...
and hence every node on the network is active at one time; this scheme is optimized for use in a
wide area networkA wide area network is a computer network that covers a broad area...
.
See also
- Cloud computing
Cloud computing is the provision of dynamically scalable and often virtualised resources as a service over the Internet on a utility basis. Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them...
- Cluster (computing)
A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks...
- Cluster manager
A Cluster manager usually is a backend GUI or command-line software that runs on one or all cluster nodes The cluster manager works together with a cluster management agent...
- Failover
In computing, failover is the capability to switch over automatically to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application,server, system, or network...
- Fault tolerant system
- Log Shipping
In Microsoft SQL Server, log shipping is the process of automating the backup of a database and transaction log files on a primary server, and then restoring them onto a standby server...
- Optimistic replication
Optimistic replication is a strategy for replication in which replicas are allowed to diverge. Traditional pessimistic replication systems are based on the principle of single-copy consistency. that is, users should observe the system to behave as if there was only one copy of the data...
- Process group
In POSIX-conformant operating systems, a process group denotes a collection of one or more processes. Process groups are used to control the distribution of signals. A signal directed to a process group is delivered individually to all of the processes that are members of the group.Process groups...
- Software transactional memory
In computer science, software transactional memory is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. It functions as an alternative to lock-based synchronization. A transaction in this context is a piece of code...
- Transaction
A transaction is an agreement, communication, or movement carried out between separate entities or objects, often involving the exchange of items of value, such as information, goods, services, and money.*Financial transaction*Real estate transaction...
- Transparency (computing)
Any change in a computing system, such as new feature or new component, is transparent if the system after change adheres to previous external interface as much as possible while changing its internal behaviour. The purpose is to shield from change all systems on the other end of the interface...
- Virtual synchrony
Virtual synchrony is an interprocess messaging passing technology. Virtual synchrony systems allow programs running in a network to organize themselves into process groups, and to send messages to groups...
External links