Cdb (software)
Encyclopedia
cdb, short for "constant database", refers to both a library and data format created by Daniel J. Bernstein
Daniel J. Bernstein
Daniel Julius Bernstein is a mathematician, cryptologist, programmer, and professor of mathematics at the University of Illinois at Chicago...

. cdb acts as an on-disk associative array
Associative array
In computer science, an associative array is an abstract data type composed of a collection of pairs, such that each possible key appears at most once in the collection....

, mapping keys to values, and allows multiple values to be stored for a single key. A constant database allows for only two operations: creation and reading. Both operations are designed to be very fast and highly reliable. Since the database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

 does not change while it is in use, multiple processes can access a single database without locking. Additionally, since all modifications are actually the creation of a replacement database, it can take advantage of UNIX filesystem semantics to provide a guarantee of reliability.

Keys, values, and total database size have no arbitrary limits, though the use of 4-byte
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...

 offsets restricts it to 4 gigabytes or smaller. cdb is used by djbdns
Djbdns
The djbdns software package is a DNS implementation created by Daniel J. Bernstein due to his frustrations with repeated BIND security holes. A $1000 prize for the first person to find a privilege escalation security hole in djbdns was awarded in March 2009 to Matthew Dempsky., djbdns's tinydns...

, fastforward, mess822, qmail
Qmail
qmail is a mail transfer agent that runs on Unix. It was written, starting December 1995, by Daniel J. Bernstein as a more secure replacement for the popular Sendmail program...

 and ucspi-tcp
Ucspi-tcp
ucspi-tcp is a public domain Unix TCP command-line tool for building TCP client-server applications. It consists of super-server tcpserver and tcpclient application....

 to provide highly efficient, reliable, and simple data access.

Structure

A database contains an entire data set (e.g. a single associative array) in a single computer file
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...

. It consists of three parts: a fixed-size header, data, and a set of hash table
Hash table
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys , to their associated values . Thus, a hash table implements an associative array...

s. Lookups are designed for exact keys only, though other types of searches could be performed by scanning the entire database. Lookups are performed using the following algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

:
  • Hash the key.
  • Determine at which hash table and slot this record
    Storage record
    In computer science, a storage record is:* A group of related data, words, or fields treated as a meaningful unit; for instance, a Name, Address, and Telephone Number can be a "Personal Record"....

     should be located.
  • Test the indicated slot in the hash table.
    • If the slot is empty, the record does not exist. Abort the search.
    • If the slot's hash matches the key's hash, seek to the record. Read and compare the key. If it matches, the data has been found, so abort the search.
    • The record is not in this slot. Proceed to the next slot, wrapping around to the beginning of the hash table if necessary.


For lookups of keys multiple values, additional values may be found by simply resuming the search at the next slot.

Format

All numbers -- offsets, lengths, and hash values -- are unsigned 32-bit
Bit
A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...

 integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

s, stored in little endian format. Keys and data are considered to be opaque byte strings
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

, and have no special treatment.

The fixed-size header at the beginning of the database describes 256 hash tables by listing their position within the file and their length in slots. Data is stored as a series of records, each storing key length, data length, key, and data. There are no alignment or sorting rules. The records are followed by a set of 256 hash tables of varying lengths. Since zero is a valid length, there may be fewer than 256 hash tables physically stored in the database, but there are nonetheless considered to be 256 tables. Hash tables contain a series of slots, each of which contains a hash value and a record offset. "Empty slots" have an offset of zero.

Hashes are unsigned 32 bit integers, and start with a value of 5381. For each byte of the key, the current hash is multiplied by 33, then XOR'ed with the current byte of the key. Overflow bits are discarded. Slots and tables are trivially computed from hashes. The target table is simply the lowest eight bits of the hash (e.g. hash modulo 256), and the slot within the table is the remaining bits of the hash modulo the table length (e.g. hash divided by 256 modulo table length).

Library

The official cdb library code is public domain
Public domain
Works are in the public domain if the intellectual property rights have expired, if the intellectual property rights are forfeited, or if they are not covered by intellectual property rights at all...

: the individual source files are marked as such, and are also available in the public domain djbdns
Djbdns
The djbdns software package is a DNS implementation created by Daniel J. Bernstein due to his frustrations with repeated BIND security holes. A $1000 prize for the first person to find a privilege escalation security hole in djbdns was awarded in March 2009 to Matthew Dempsky., djbdns's tinydns...

 package. However, the rest of the cdb package is license-free software, meaning it must be distributed verbatim. The unusual licensing and simplicity of the format has prompted others to re-implement the library and release it under more common terms, such as Michael Tokarev's TinyCDB library, available under the public domain.

Notably, the creator of cdb does not intend for cdb to be used as a shared library. This differs from virtually all similar databases, such as Berkeley DB
Berkeley DB
Berkeley DB is a computer software library that provides a high-performance embedded database for key/value data. Berkeley DB is a programmatic software library written in C with API bindings for C++, PHP, Java, Perl, Python, Ruby, Tcl, Smalltalk, and most other programming languages...

.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK