Change data capture
Encyclopedia
In database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

s, change data capture (CDC) is a set of software design patterns
Design pattern (computer science)
In software engineering, a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design. A design pattern is not a finished design that can be transformed directly into code. It is a description or template for how to solve a problem that...

 used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.

CDC solutions occur most often in data-warehouse
Data warehouse
In computing, a data warehouse is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting.A data warehouse...

 environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.

Methodology

System developers can set up CDC mechanisms in a number of ways and in any one or a combination of system layers from application logic down to physical storage.

In a simplified CDC context, one computer system has data believed to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that does not change the design patterns logically.

Not uncommonly, multiple CDC solutions can exist in a single system.

Timestamps on rows

Tables whose changes must be captured may have a column that represents the time of last change. Names such as LAST_UPDATE, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed.

Version Numbers on rows

Database designers give tables whose changes must be captured a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. When data in a row changes, its version number is updated to the current version. A supporting construct such as a reference table with the current version in it is needed. When a change capture occurs, all data with the latest version number is considered to have changed. When the change capture is complete, the reference table is updated with a new version number.

Three or four major techniques exist for doing CDC with version numbers, the above paragraph is just one.

Status indicators on rows

This technique can either supplement or complement timestamps and versioning. It can configure an alternative if, for example, a status column is set up on a table row indicating that the row has changed (e.g. a boolean column that, when set to true, indicates that the row has changed). Otherwise, it can act as a complement to the previous methods, indicating that a row, despite having a new version number or an earlier date, still shouldn't be updated on the target (for example, the data may require human validation).

Time/Version/Status on rows

This approach combines the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status provides a particularly powerful mechanism and programmers should utilize them as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 6/1/2005 12:00 a.m. and 7/1/2005 12:00 a.m. where the status code indicates it is ready for production."

Triggers on tables

May include a publish/subscribe
Observer pattern
The observer pattern is a software design pattern in which an object, called the subject, maintains a list of its dependents, called observers, and notifies them automatically of any state changes, usually by calling one of their methods...

 pattern to communicate the changed data to multiple targets. In this approach, triggers
Database trigger
A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for keeping the integrity of the information on the database...

 log events that happen to the transactional table into another queue table that can later be "played back". For example, imagine an Accounts table, when transactions are taken against this table, triggers would fire that would then store a history of the event or even the deltas into a separate queue table. The queue table might have schema with the following fields: Id, TableName, RowId, TimeStamp, Operation. The data inserted for our Account sample might be: 1, Accounts, 76, 11/02/2008 12:15am, Update.
More complicated designs might log the actual data that changed. This queue table could then be "played back" to replicate the data from the source system to a target.

[More discussion needed]

An example of this technique is the pattern known as the log trigger
Log trigger
In relational databases, the Log trigger or History trigger is a mechanism for automatic recording of information about changes -inserting, updating and deleting rows- in a database table....

.

Log scanners on databases

Most database management systems manage a transaction log
Transaction log
In the field of databases in computer science, a transaction log is a history of actions executed by a database management system to guarantee ACID properties over crashes or hardware failures...

 that records changes made to the database contents and to metadata
Metadata
The term metadata is an ambiguous term which is used for two fundamentally different concepts . Although the expression "data about data" is often used, it does not apply to both in the same way. Structural metadata, the design and specification of data structures, cannot be about data, because at...

. By scanning and interpreting the contents of the database transaction log one can capture the changes made to the database in a non-intrusive manner.

Using transaction logs for change data capture offers a challenge in that the structure, contents and use of a transaction log is specific to a database management system. Unlike data access, no standard exists for transaction logs. Most database management systems do not document the internal format of their transaction logs, although some provide programmatic interfaces to their transaction logs (for example: Oracle, DB2, SQL/MP, SQL/MX and SQL Server 2008).

Other challenges in using transaction logs for change data capture include:
  • Coordinating the reading of the transaction logs and the archiving of log files (database management software typically archives log files off-line on a regular basis).
  • Translation between physical storage formats that are recorded in the transaction logs and the logical formats typically expected by database users (e.g., some transaction logs save only minimal buffer differences that are not directly useful for change consumers).
  • Dealing with changes to the format of the transaction logs between versions of the database management system.
  • Eliminating uncommitted changes that the database wrote to the transaction log and later rolled back
    Rollback (data management)
    In database technologies, a rollback is an operation which returns the database to some previous state. Rollbacks are important for database integrity, because they mean that the database can be restored to a clean copy even after erroneous operations are performed...

    .
  • Dealing with changes to the metadata of tables in the database.


CDC solutions based on transaction log files have distinct advantages that include:
  • minimal impact on the database (even more so if one uses log shipping
    Log shipping
    Log shipping is the process of automating the backup of a database and transaction log files on a primary database server, and then restoring them onto a standby server. This technique is supported by Microsoft SQL Server and PostgreSQL...

     to process the logs on a dedicated host).
  • no need for programmatic changes to the applications that use the database.
  • low latency
    Latency (engineering)
    Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...

     in acquiring changes.
  • transactional integrity
    Data integrity
    Data Integrity in its broadest meaning refers to the trustworthiness of system resources over their entire life cycle. In more analytic terms, it is "the representational faithfulness of information to the true state of the object that the information represents, where representational faithfulness...

    : log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction.
  • no need to change the database schema


Several off-the-shelf products perform change data capture using database transaction log files. These include:
  • Attunity Stream
  • Centerprise Data Integrator from Astera
  • DatabaseSync from WisdomForce
  • GoldenGate Transactional Data Integration
  • HVR from HVR Software
  • DBMoto from HiT Software
    HiT Software
    Founded in 1994 in San Jose, California, HiT Software is a commercial software development company with headquarters in San Jose, California and offices in Italy, Hong Kong and Chile. HiT Software produces Data Integration and Change Data Capture software tools...

  • Shadowbase from Gravic
  • IBM InfoSphere
    IBM InfoSphere
    IBM Infosphere is a branded product line from IBM under its Information Management Software brand, announced in February 2008, which includes software products from its WebSphere and Information Server product lines...

     Change Data Capture (previously DataMirror
    DataMirror
    DataMirror Corporation provides real-time data integration, protection, and Java database solutions. Founded in 1993, DataMirror is headquartered in Markham, Ontario, Canada with offices worldwide...

     Transformation Server)
  • Informatica PowerExchange CDC Option (previously Striva)
  • Oracle Streams
    Oracle Streams
    In computing, the Oracle Streams product from Oracle Corporation encourages users of Oracle databases to propagate information within and between databases. It provides tools to capture, process and manage database events via Advanced Queuing queues....

  • Oracle Data Guard
    Oracle Data Guard
    The software which Oracle Corporation markets as Oracle Data Guard forms an extension to the Oracle RDBMS. It aids in establishing and maintaining secondary "standby databases" as alternative/supplementary repositories to production "primary databases"....

  • Replicate1 from Vision Solutions
  • SharePlex from Quest Software
    Quest Software
    Quest Software is a computer software manufacturer headquartered in Aliso Viejo, California. Founded in 1987, Quest develops, manufactures and supports software used by Information Technology professionals in a variety of industries...

  • FlexCDC, part of Flexviews for MySQL

Confounding factors

As often occurs in complex domains, the final solution to a CDC problem may have to balance many competing concerns.

Push versus pull

CDC Tool Comparison
-----------------------

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK