All Topics  
Hash table

 

   Email Print
   Bookmark   Link






 

Hash table



 
 
In computer science
Computer science

Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems....
, a hash table, or a hash map, is a data structure
Data structure

A data structure in computer science is a way of storing data in a computer so that it can be used efficiently. It is an organization of mathematical and logical concepts of data....
 that associates keys
Unique key

In relational database design, a unique key or primary key is a candidate key to uniquely identify each row in a table . A unique key or primary key comprises a single column or set of columns....
 with values
Value (mathematics)

In mathematics, value commonly refers to the 'output' of a Function . In the most basic case, that of unary, single-valued functions, there is one input and one output .The function of the example is real-valued, since each and every possible function value is real....
.

The primary operation that hash functions support efficiently is a lookup: given a key (e.g., a person's name), find the corresponding value (e.g., that person's telephone number). It works by transforming the key using a hash function
Hash function

A hash function is any algorithm or function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an array index into an array....
 into a hash, a number that is used as an index in an array
Array

In computer science, an array is a data structure consisting of a group of element s that are accessed by index . In most programming languages each element has the same data type and the array occupies a contiguous area of computer memory....
 to locate the desired location ("bucket") where the values should be.

In most cases the hash function is deliberately chosen to have pseudo-random properties, so that small changes of a key give a large and apparently random (although of course reproducible) effect on the hash returned.






Discussion
Ask a question about 'Hash table'
Start a new discussion about 'Hash table'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In computer science
Computer science

Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems....
, a hash table, or a hash map, is a data structure
Data structure

A data structure in computer science is a way of storing data in a computer so that it can be used efficiently. It is an organization of mathematical and logical concepts of data....
 that associates keys
Unique key

In relational database design, a unique key or primary key is a candidate key to uniquely identify each row in a table . A unique key or primary key comprises a single column or set of columns....
 with values
Value (mathematics)

In mathematics, value commonly refers to the 'output' of a Function . In the most basic case, that of unary, single-valued functions, there is one input and one output .The function of the example is real-valued, since each and every possible function value is real....
.

The primary operation that hash functions support efficiently is a lookup: given a key (e.g., a person's name), find the corresponding value (e.g., that person's telephone number). It works by transforming the key using a hash function
Hash function

A hash function is any algorithm or function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an array index into an array....
 into a hash, a number that is used as an index in an array
Array

In computer science, an array is a data structure consisting of a group of element s that are accessed by index . In most programming languages each element has the same data type and the array occupies a contiguous area of computer memory....
 to locate the desired location ("bucket") where the values should be.

In most cases the hash function is deliberately chosen to have pseudo-random properties, so that small changes of a key give a large and apparently random (although of course reproducible) effect on the hash returned. Because of this random effect, in some cases, the calculated index can be the same for two different keys (a "collision"); different hash table designs handle this issue in different ways.

Hash tables support the efficient lookup, insertion and deletion of elements in constant time on average (O(1)
Big O notation

In mathematics, big O notation describes the asymptotic analysis of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions....
) that does not vary with the number of elements stored in the table; although may vary somewhat depending on how full the table is.

Hashtb08

Basic operation

A hash table works by transforming the key using a hash function
Hash function

A hash function is any algorithm or function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an array index into an array....
 into a hash, a number that is used as an index in an array to locate the desired location ("bucket") where the values should be. The number is normally converted into the index by taking a modulo operation
Modulo operation

In computing, the modulo operation finds the remainder of division of one number by another.Given two numbers, and , a modulo n is the remainder, on division of a by n....
, or sometimes bit masking is used where the array size is a power of two. The optimal hash function for any given use of a hash table can vary widely, however, depending on the nature of the key.

Typical operations on a hash table include insertion, deletion and lookup (although some hash tables are precalculated so that only lookups may be done on a live system). These operations are all performed in amortized constant time, which makes maintaining and accessing a huge hash table very efficient.

It is also possible to create a hash table statically where, for example, there is a fairly limited fixed set of input values - such as the values representable by a single byte (or possibly two bytes ) from which an index can be constructed directly (see section below on creating hash tables). The hash table can also be used simultaneously for tests of validity on the values that are disallowed.

Load factor

An important property of a hash table is how "full" the table is, that is the ratio between the number of entries n and the size s of the hash table (i.e. the size of the array it uses to store values). The quotient n/s is therefore called the load factor of the hash table.

Most hash table implementations only perform well if the load factor is kept in a certain range.

Collision resolution

Since the intent of hashing is to create unique hash (or location) for each key, when the hash function produces the same hash for multiple keys, an alternate location must be determined. The process of finding an alternate location is called collision resolution. A collision resolution strategy ensures future key lookup operations return the correct, respective records.

An efficient collision resolution strategy is an important part of any hash table. Consider the following case, derived using the birthday paradox
Birthday paradox

In probability theory, the birthday problem, or birthday paradox pertains to the probability that in a set of randomly chosen people some pair of them will have the same birthday....
, of a hash table containing 1 million indices. Even if a hash function
Hash function

A hash function is any algorithm or function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an array index into an array....
 were to output random indices uniformly distributed
Uniform distribution (discrete)

In probability theory and statistics, the discrete uniform distribution is a discrete probability distribution that can be characterized by saying that all values of a finite set of possible values are equally probable....
 over the array, there is a 95% chance of a collision occurring before it contains 2500 records.

There are a number of collision resolution techniques, but the most popular are open addressing and chaining.

Separate chaining


Hashtb32
Sometimes called simply chaining or direct chaining, in its simplest form each slot in the array is a linked list
Linked list

In computer science, a linked list is one of the fundamental data structures, and can be used to implement other data structures. It consists of a sequence of node s, each containing arbitrary data Field s and one or two reference s pointing to the next and/or previous nodes....
, or the head cell of a linked list, where the list contains the elements that hashed to the same location. Insertion requires finding the correct slot, then appending to either end of the list in that slot; deletion requires searching the list and removal.

Chained hash tables have advantages over open addressed hash tables in that the removal operation is simple and resizing the table can be postponed for a much longer time because performance degrades more gracefully even when every slot is used. Indeed, many chaining hash tables may not require resizing at all since performance degradation is linear as the table fills. For example, a chaining hash table containing twice its recommended capacity of data would only be about twice as slow on average as the same table at its recommended capacity.

Chained hash tables inherit the disadvantages of linked lists. When storing small records, the overhead of the linked list can be significant. An additional disadvantage is that traversing a linked list has poor cache performance
Locality of reference

In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related computer storage locations being frequently accessed....
.

Alternative data structures can be used for chains instead of linked lists. By using a self-balancing tree
Self-balancing binary search tree

In computer science, a self-balancing binary search tree or height-balanced binary search tree is a binary search tree that attempts to keep its height, or the number of levels of nodes beneath the root, as small as possible at all times, automatically....
, for example, the theoretical worst-case time of a hash table can be brought down to O(log n) rather than O(n). However, since each list is intended to be short, this approach is usually inefficient unless the hash table is designed to run at full capacity or there are unusually high collision rates, as might occur in input designed to cause collisions. Dynamic array
Dynamic array

In computer science, a dynamic array, growable array, resizable array, dynamic table, or array list is an array data structure that can be resized and allows elements to be added or removed....
s can also be used to decrease space overhead and improve cache performance, forming a cache-conscious resolution scheme as discussed in the section Cache-conscious collision resolution.

Some chaining implementations use an optimization where the first record of each chain is stored in the table. The purpose is to increase cache efficiency of hash table access. In order to avoid wasting large amounts of space, such hash tables would maintain a load factor of 1.0 or greater. The term direct chaining is sometimes used to describe implementations that do not use this optimization.

Open addressing


Hashtb12
Open addressing hash tables store the records directly within the array. This approach is also called closed hashing. A hash collision is resolved by probing, or searching through alternate locations in the array (following a probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table. Well known probe sequences include:

linear probing
Linear probing

Linear probing is a scheme in computer programming for resolving hash collisions of values of hash functions by sequentially searching the hash table for a free location....
 : in which the interval between probes is fixed - often at 1. quadratic probing
Quadratic probing

Quadratic probing is a scheme in computer programming for resolving collisions in hash tables.Quadratic probing operates by taking the original hash value and adding successive values of an arbitrary quadratic polynomial to the starting value....
 : in which the interval between probes increases proportional to the square of the probe number (the interval thus increasing linearly and the indices are described by a quadratic function). double hashing
Double hashing

Double hashing is a computer programming technique used in hash tables to resolve hash collisions, cases when two different values to be searched for produce the same hash key....
 : in which the interval between probes is computed by another hash function.

Open addressing versus chaining

Chained hash tables have the following benefits over open addressing:
  • They are simple to implement effectively and only require basic data structures.
  • From the point of view of writing suitable hash functions, chained hash tables are insensitive to clustering, only requiring minimization of collisions. Open addressing depends upon better hash functions to avoid clustering. This is particularly important if novice programmers can add their own hash functions, but even experienced programmers can be caught out by unexpected clustering effects.
  • They degrade in performance more gracefully. Although chains grow longer as the table fills, a chained hash table cannot "fill up" and does not exhibit the sudden increases in lookup times that occur in a near-full table with open addressing. (see chart)
  • If the hash table stores large records, about 5 or more words per record, chaining uses less memory than open addressing.
  • If the hash table is sparse (that is, it has a big array with many free array slots), chaining uses less memory than open addressing even for small records of 2 to 4 words per record due to its external storage.


Hash Table Average Insertion Time
For small record sizes (a few words or less) the benefits of in-place open addressing compared to chaining are:
  • They can be more space-efficient than chaining since they don't need to store any pointers or allocate any additional space outside the hash table. Simple linked lists require a word of overhead per element.
  • Insertions avoid the time overhead of memory allocation, and can even be implemented in the absence of a memory allocator.
  • Because it uses internal storage, open addressing avoids the extra indirection required for chaining's external storage. It also has better locality of reference
    Locality of reference

    In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related computer storage locations being frequently accessed....
    , particularly with linear probing. With small record sizes, these factors can yield better performance than chaining, particularly for lookups.
  • They can be easier to serialize
    Serialization

    In computer science, in the context of data storage and transmission, serialization is the process of converting an object into a sequence of bits so that it can be stored on a storage medium or transmitted across a computer network connection link....
    , because they don't use pointers.


On the other hand, normal open addressing is a poor choice for large elements, since these elements fill entire cache lines
CPU cache

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access computer storage. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations....
 (negating the cache advantage), and a large amount of space is wasted on large empty table slots. If the open addressing table only stores references to elements (external storage), it uses space comparable to chaining even for large records but loses its speed advantage.

Generally speaking, open addressing is better used for hash tables with small records that can be stored within the table (internal storage) and fit in a cache line. They are particularly suitable for elements of one word or less. In cases where the tables are expected to have high load factors, the records are large, or the data is variable-sized, chained hash tables often perform as well or better.

Ultimately, used sensibly, any kind of hash table algorithm is usually fast enough; and the percentage of a calculation spent in hash table code is low. Memory usage is rarely considered excessive. Therefore, in most cases the differences between these algorithms is marginal, and other considerations typically come into play.

Coalesced hashing


A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes within the table itself. Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more elements than table slots.

Perfect hashing


If all of the keys that will be used are known ahead of time, and there are no more keys than can fit the hash table, perfect hashing can be used to create a perfect hash table, in which there will be no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well.

Perfect hashing gives a hash table where the time to make a lookup is constant in the worst case. This is in contrast to chaining and open addressing methods, where the time for lookup is low on average, but may be arbitrarily large. A simpler alternative, which also gives worst case constant lookup time, is cuckoo hashing
Cuckoo hashing

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a hash table. Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in 2001....
.

Dynamic perfect hashing


In dynamic perfect hashing, a guarantee of constant worst-case lookup times by using a second-level hash table with no collisions for each bucket is given. If there are k elements in the bucket, the bucket's hash table is given size k2, and its hash function is selected at random from a universal hash function set. If a collision ever occurs, the table is rebuilt with a different randomly-selected hash function. Because the load factor of the table is kept low (1/k), rebuilding is infrequent and amortized cost of insertions is low.

Although each bucket requires quadratic space, if the keys inserted into the (first-level) hash table are uniformly distributed, the structure as a whole occupies expected O(n) space, since bucket sizes are small with high probability.

Probabilistic hashing


Perhaps the simplest solution to a collision is to replace the value that is already in the slot with the new value, or slightly less commonly, drop the record that is to be inserted. In later searches, this may result in a search not finding a record which has been inserted. This technique is particularly useful for implementing caching.

An even more space-efficient solution which is similar to this is use a bit array
Bit array

A bit array is an array data structure which compactly stores individual bits . It implements a simple set data structure storing a subset of and is effective at exploiting bit-level parallelism in hardware to perform operations quickly....
 (an array of one-bit fields) for our table. Initially all bits are set to zero, and when we insert a key, we set the corresponding bit to one. False negatives cannot occur, but false positives can, since if the search finds a 1 bit, it will claim that the value was found, even if it was just another value that hashed into the same array slot by coincidence. In reality, such a hash table is merely a specific type of Bloom filter
Bloom filter

The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set ....
.

Robin Hood hashing


One interesting variation on double-hashing collision resolution is that of Robin Hood hashing. The idea is that a key already inserted may be displaced by a new key if its probe count is larger than the key at the current position. The net effect of this is that it reduces worst case search times in the table. This is similar to Knuth's ordered hash tables except the criteria for bumping a key does not depend on a direct relationship between the keys. Since both the worst case and the variance on the number of probes is reduced dramatically an interesting variation is to probe the table starting at the expected successful probe value and then expand from that position in both directions.

Cache-conscious collision resolution


A chained hash table is a simple and flexible data structure but it does not make good use of CPU cache. This is especially the case when variable-length strings are used as keys. When a collision occurs, a linked list is traversed which can attract a surplus of cache-misses. In addition, linked lists will waste an excessive amount of space due to pointers and memory allocation overheads. We can address these issues by replacing linked lists with dynamic arrays, forming a cache-conscious hash table known as an array hash.

When a string is hashed to a slot, it is appended to the end of the dynamic array that is assigned to that slot. In this manner, nodes and pointers are eliminated and strings are accessed in a sequential manner, promoting good use of cache. The array hash was also shown to be substantially faster and more compact than a chained hash table using 32-bit or 64-bit integer keys.

Table resizing


With a good hash function, a hash table can typically contain about 70%–80% as many elements as it does table slots and still perform well. Depending on the collision resolution mechanism, performance can begin to suffer either gradually or dramatically as more elements are added. To deal with this, when the load factor exceeds some threshold, it is necessary to allocate a new, larger table, and add all the contents of the original table to this new table. In Java
Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java ....
's class, for example, the default load factor threshold is 0.75.

This can be a very expensive operation, and the necessity for it is one of the hash table's disadvantages. In fact, some naive methods for doing this, such as enlarging the table by one each time you add a new element, reduce performance so drastically as to make the hash table useless. However, if the table is enlarged by some fixed percent, such as 10% or 100%, it can be shown using amortized analysis
Amortized analysis

In computer science, especially analysis of algorithms, amortized analysis finds the average running time per operation over a worst-case sequence of operations....
 that these resizings are so infrequent that the average time per insertion remains constant-time. To see why this is true, suppose a hash table using chaining begins at the minimum size of 1 and is doubled each time it fills above 100%. If in the end it contains n elements, then the total add operations performed for all the resizings is:

Because the costs of the resizings form a geometric series
Geometric series

In mathematics, a geometric series is a series with a constant ratio between successive term . For example, the seriesis geometric, because each term is equal to half of the previous term....
, the total cost is O(n). But it is necessary also to perform n operations to add the n elements in the first place, so the total time to add n elements with resizing is O(n), an amortized time of O(1) per element.

On the other hand, some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. One simple approach is to initially allocate the table with enough space for the expected number of elements and forbid the addition of too many elements. Another useful but more memory-intensive technique is to perform the resizing gradually:
  • Allocate the new hash table, but leave the old hash table and check both tables during lookups.
  • Each time an insertion is performed, add that element to the new table and also move k elements from the old table to the new table.
  • When all elements are removed from the old table, deallocate it.
To ensure that the old table will be completely copied over before the new table itself needs to be enlarged, it's necessary to increase the size of the table by a factor of at least (k + 1)/k during the resizing.

Linear hashing is a hash table algorithm that permits incremental hash table expansion. It is implemented using a single hash table, but with two possible look-up functions.

Another way to decrease the cost of table resizing is to choose a hash function in such a way that the hashes of most values do not change when the table is resized. This approach, called consistent hashing
Consistent hashing

Consistent hashing is a scheme that provides hash table functionality in a way that the addition or removal of one slot does not significantly change the mapping of keys to slots....
, is prevalent in disk-based and distributed hashes, where resizing is prohibitively costly.

Time complexity and common uses of hash tables

Hash tables are often used to implement associative array
Associative array

An associative array is an abstract data type composed of a Collection of unique keys and a collection of values, where each key is associated with one value ....
s, sets
Set (computer science)

In computer science, a set is a collection of certain values, without any particular Canonical order, and no repeated values. It corresponds with a finite set in mathematics....
 and cache
Cache

In computer science, a cache is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data is expensive to fetch or to compute, compared to the cost of reading the cache....
s. Like array
Array

In computer science, an array is a data structure consisting of a group of element s that are accessed by index . In most programming languages each element has the same data type and the array occupies a contiguous area of computer memory....
s, hash tables provide constant-time O(1) lookup on average, regardless of the number of items in the table. While theoretically the worst-case lookup time can be as bad as O(n), this is, for practical purposes, statistically unlikely unless the hash function is poorly designed or unless the set of keys is maliciously chosen with the given hash function in mind. These corner cases are addressed in mathematical analysis with the Simple Uniform Hashing Assumption
SUHA

In computer science, SUHA or the uniform hashing assumption is a basic assumption that facilitates the mathematical analysis of Hash_Table....
, which puts basic assumed conditions on the hash function.

Compared to other associative array data structures, hash tables are most useful when large numbers of records are to be stored, especially if the size of the data set can be predicted.

Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structure
Persistent data structure

In computing, a persistent data structure is a data structure which always preserves the previous version of itself when it is modified; such data structures are effectively immutable, as their operations do not update the structure in-place, but instead always yield a new updated structure....
s; database indices
Index (database)

A database index is a data structure that improves the speed of operations on a Table . Indexes can be created using one or more column , providing the basis for both rapid random look ups and efficient access of ordered records....
 sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.

Choosing a good hash function

A good hash function is essential for good hash table performance. A poor choice of a hash function is likely to lead to clustering, in which probability of keys mapping to the same hash bucket (i.e. a collision) is significantly greater than would be expected from a random function. A nonzero collision probability is inevitable in any hash implementation, but usually the number of operations required to resolve a collision scales linearly with the number of keys mapping to the same bucket, so excess collisions will degrade performance significantly. In addition, some hash functions are computationally expensive, so the amount of time (and, in some cases, memory) taken to compute the hash may be burdensome.

Choosing a good hash function is tricky. The literature is replete with poor choices, at least when measured by modern standards. For example, the very popular multiplicative hash advocated by Donald Knuth
Donald Knuth

Donald Ervin Knuth is a renowned computer science and Emeritus of the Art of Computer Programming at Stanford University.Author of the seminal multi-volume work The Art of Computer Programming , Knuth has been called the "father" of the run-time analysis, contributing to the development of, and systematizing formal mathematical techn...
 in The Art of Computer Programming
The Art of Computer Programming

The Art of Computer Programming is a comprehensive monograph written by Donald Knuth that covers many kinds of programming algorithms and their analysis....
 (see reference below) has particularly poor clustering behavior. However, since poor hashing merely degrades hash table performance for particular input key distributions, such problems commonly go undetected.

The literature is similarly sparse on the criteria for choosing a hash function. Unlike most other fundamental algorithms and data structures, there is no universal consensus on what makes a "good" hash function. The remainder of this section is organized by three criteria: simplicity, speed, and strength. In addition, it surveys algorithms known to perform well by these criteria.

Simplicity and speed are readily measured objectively (by number of lines of code and CPU benchmarks, for example), but strength is a more slippery concept. Obviously, a cryptographic hash function
Cryptographic hash function

A cryptographic hash function is a algorithm that takes an arbitrary block of data and returns a fixed-size bit string, the hash value, such that an accidental or intentional change to the data will almost certainly change the hash value....
 such as SHA-1
SHA hash functions

The SHA hash functions are a set of cryptographic hash functions designed by the National Security Agency and published by the National Institute of Standards and Technology as a U.S....
 would satisfy the relatively lax strength requirements needed for hash tables, but their slowness and complexity makes them unappealing. However, using cryptographic hash functions can protect against collision attacks when the hash table modulus and its factors can not be kept secret from the attacker, or alternatively, by applying a secret salt
Salt (cryptography)

In cryptography, a salt comprises Random Number Generator bits that are used as one of the inputs to a key derivation function. The other input is usually a password or passphrase....
. However, for these specialized cases, a universal hash function can be used instead of one static hash.

In the absence of a standard measure for hash function strength, the current state of the art is to employ a battery of statistical
Statistics

Statistics is a Mathematics pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It also provides tools for prediction and forecasting based on data....
 tests to measure whether the hash function can be readily distinguished from a random function. Arguably the most important test is to determine whether the hash function displays the avalanche effect
Avalanche effect

In cryptography, the avalanche effect refers to a desirable property of cryptographic algorithms, typically block ciphers and cryptographic hash functions....
, which essentially states that any single-bit change in the input key should affect, on average, half the bits in the output. Bret Mulvey advocates testing the strict avalanche condition in particular, which states that, for any single-bit change, each of the output bits should change with probability one-half, independent of the other bits in the key. Purely additive hash functions such as CRC
Cyclic redundancy check

A cyclic redundancy check is a type of function that takes as input a data stream of any length, and produces as output a value of a certain space, commonly a 32-bit integer....
 fail this stronger condition miserably.

Clearly, a strong hash function should have a uniform distribution
Uniform distribution

Uniform distribution can refer to:...
 of hash values. Bret Mulvey proposes the use of a chi-squared test
Chi-square test

A chi-square test is any statistical hypothesis test in which the test statistic has a chi-square distribution when the null hypothesis is true, or any in which the probability distribution of the test statistic can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough....
 for uniformity, based on power of two
Power of two

In mathematics, a power of two is any of the integer exponentiation of the number 2 ; in other words, two multiplication by itself a certain number of times....
 hash table sizes ranging from 21 to 216. This test is considerably more sensitive than many others proposed for measuring hash functions, and finds problems in many popular hash functions.

Fortunately, there are good hash functions that satisfy all these criteria. The simplest class all consume one byte of the input key per iteration of the inner loop. Within this class, simplicity and speed are closely related, as fast algorithms simply don't have time to perform complex calculations.

A mathematical byte-by-byte implementation that performs particularly well is the Jenkins One-at-a-time hash, adapted here from , its creator.

uint32_t jenkins_one_at_a_time_hash(unsigned char *key, size_t key_len)



The avalanche behavior of this hash is shown on the right. The image was made using Bret Mulvey's AvalancheTest in his .

Each of the 24 rows corresponds to a single bit in the 3-byte input key, and each of the 32 columns corresponds to a bit in the output hash. Colors are chosen by how well the input key bit affects the given output hash bit: a green square indicates good mixing behavior, a yellow square weak mixing behavior, and red would indicate no mixing. Only a few bits in the last byte of the input key are weakly mixed to a minority of bits in the output hash, a performance vastly better than a number of widely used hash functions.

Many commonly used hash functions perform poorly when subjected to such rigorous avalanche testing. The widely favored FNV
Fowler Noll Vo hash

Fowler/Noll/Vo is a non-Cryptographic hash function hash function created by Glenn Fowler, Landon Curt Noll, and Phong Vo....
 hash, for example, shows many bits with no mixing at all, especially for short keys. See the by Bret Mulvey for a more thorough analysis.

If speed is more important than simplicity, then the class of hash functions which consume multibyte chunks per iteration may be of interest. One of the most sophisticated is "lookup3" by Bob Jenkins, which consumes input in 12 byte (96 bit) chunks. Note, though, that any speed improvement from the use of this hash is only likely to be useful for large keys, and that the increased complexity may also have speed consequences such as preventing an optimizing compiler from inlining the hash function. Bret Mulvey analyzed an , and found it to have excellent avalanche behavior.

One desirable property of a hash function is that conversion from the hash value (typically 32 bits) to a bucket index for a particular-size hash table can be done simply by masking, preserving only the lower k bits for a table of size 2k (an operation equivalent to computing the hash value modulo
Modulo operation

In computing, the modulo operation finds the remainder of division of one number by another.Given two numbers, and , a modulo n is the remainder, on division of a by n....
 the table size). This property enables the technique of incremental doubling of the size of the hash table - each bucket in the old table maps to only two in the new table. Because of its use of XOR-folding, the FNV hash does not have this property. Some older hashes are even worse, requiring table sizes to be a prime number rather than a power of two, again computing the bucket index as the hash value modulo the table size. In general, such a requirement is a sign of a fundamentally weak function; using a prime table size is a poor substitute for using a stronger function.

Ordered retrieval issue

Hash tables store data in pseudo-random locations, so accessing the data in a sorted manner is a very time consuming operation. Other data structures such as self-balancing binary search tree
Self-balancing binary search tree

In computer science, a self-balancing binary search tree or height-balanced binary search tree is a binary search tree that attempts to keep its height, or the number of levels of nodes beneath the root, as small as possible at all times, automatically....
s generally operate more slowly (since their lookup time is O(log n)) and are rather more complex to implement than hash tables but maintain a sorted data structure at all times. See a comparison of hash tables and self-balancing binary search trees
Associative array

An associative array is an abstract data type composed of a Collection of unique keys and a collection of values, where each key is associated with one value ....
.

Problems with hash tables


Although hash table lookups use constant time on average, the time spent can be significant. Evaluating a good hash function can be a slow operation. In particular, if simple array indexing can be used instead, this is usually faster.

Hash tables in general exhibit poor locality of reference
Locality of reference

In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related computer storage locations being frequently accessed....
—that is, the data to be accessed is distributed seemingly at random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor cache
CPU cache

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access computer storage. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations....
 misses that cause long delays. Compact data structures such as arrays, searched with linear search
Linear search

In computer science, linear search is a search algorithm, also known as sequential search, that is suitable for searching a list of data for a particular value....
, may be faster if the table is relatively small and keys are cheap to compare, such as with simple integer keys. According to Moore's Law
Moore's Law

Moore's law describes a long-term trend in the history of computing hardware. Since the invention of the integrated circuit in 1958, the number of transistors that can be placed inexpensively on an integrated circuit has increased exponential growth, doubling approximately every two years....
, cache sizes are growing exponentially and so what is considered "small" may be increasing. The optimal performance point varies from system to system.

More significantly, hash tables are more difficult and error-prone to write and use. Hash tables require the design of an effective hash function for each key type, which in some situations is more difficult and time-consuming to design and debug than the simple comparison function required for a self-balancing binary search tree
Self-balancing binary search tree

In computer science, a self-balancing binary search tree or height-balanced binary search tree is a binary search tree that attempts to keep its height, or the number of levels of nodes beneath the root, as small as possible at all times, automatically....
. In open-addressed hash tables it is fairly easy to create a poor hash function.

Additionally, in some applications, a black hat
Black hat

A black hat is the villain or bad guy, especially in a Western in which such a character would wear a black hat in contrast to the hero white hat....
 with knowledge of the hash function may be able to supply information to a hash which creates worst-case behavior by causing excessive collisions, resulting in very poor performance (i.e., a denial of service attack). In critical applications, either universal hashing
Universal hashing

Universal hashing is a randomized algorithm for selecting a hash function F with the following property: for any two distinct inputs x and y, the probability that F=F is the same as if F was a random function....
 can be used or a data structure with better worst-case guarantees may be preferable. For details, see Crosby and Wallach's .

Implementations

While many programming languages already provide hash table functionality (see language support for associative arrays
Associative array

An associative array is an abstract data type composed of a Collection of unique keys and a collection of values, where each key is associated with one value ....
), there are several independent implementations worth mentioning.
  • The Google SparseHash project contains several hash-map implementations in use at Google, with different performance characteristics, including an implementation that optimizes for space and one that optimizes for speed. The memory-optimized one is extremely memory-efficient with only 2 bits/entry of overhead.
  • An open source C library for hash table storage of arbitrary data objects with lock-free lookups, built-in reference counting and guaranteed order iteration. The library can participate in external reference counting systems or use its own built-in reference counting. It comes with a variety of hash functions and allows the use of runtime supplied hash functions via callback mechanism. Source code is well documented.
  • This is an easy-to-use hash table for C structures.
  • A number of language runtimes and/or standard libraries use hash tables to implement their support for associative arrays.
  • Software written to minimize memory usage can conserve memory by keeping all allocated strings in a hash table. If an already existing string is found a pointer to that string is returned; otherwise, a new string is allocated and added to the hash table. (This is the normal technique used in Lisp for the names of variables and functions; see the documentation for the intern and intern-soft functions if you are using that language.) The data compression achieved in this manner is usually around 40%.


See also


Further reading

  • Donald Knuth
    Donald Knuth

    Donald Ervin Knuth is a renowned computer science and Emeritus of the Art of Computer Programming at Stanford University.Author of the seminal multi-volume work The Art of Computer Programming , Knuth has been called the "father" of the run-time analysis, contributing to the development of, and systematizing formal mathematical techn...
    . The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998. ISBN 0-201-89685-0. Section 6.4: Hashing, pp.513–558.
  • Thomas H. Cormen
    Thomas H. Cormen

    Thomas H. Cormen is the co-author of Introduction to Algorithms, along with Charles Leiserson, Ron Rivest, and Clifford Stein. He is a Full Professor of computer science at Dartmouth College and currently Chair of the Dartmouth College Writing Program....
    , Charles E. Leiserson
    Charles E. Leiserson

    Charles Eric Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof; as part of this effort, he developed the Cilk multithreaded language....
    , Ronald L. Rivest, and Clifford Stein
    Clifford Stein

    Clifford Stein, a computer scientist, is currently a professor of industrial engineering and operations research at Columbia University in New York, NY, where he also holds an appointment in the Department of Computer Science....
    . Introduction to Algorithms
    Introduction to Algorithms

    Introduction to Algorithms is a book by Thomas H. Cormen, Charles E. Leiserson, Ron Rivest, and Clifford Stein. It is used as the textbook for algorithms courses at many universities....
    , Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 11: Hash Tables, pp.221–252.
  • Michael T. Goodrich and Roberto Tamassia. Data Structures and Algorithms in Java, 4th edition. John Wiley & Sons, Inc. ISBN 0-471-73884-0. Chapter 9: Maps and Dictionaries. pp.369–418
  • Aaron M. Tenenbaum, Yedidyah Langsam, and Moshe J. Augenstein. Data Structures Using C, Prentice Hall ISBN 0-13-199746-7. Chapter 7, Section 7.4 - Hashing. pp.454–502


External links

  • by Bob Jenkins.
  • by Bret Mulvey (Pluto Scarab) - with nice graphs
  • by Paul Hsieh
  • by Maciej Stachowiak
  • is one of the most feature rich hash libraries (built-in hash functions, several collision strategies, extensive analysis functionality, ...)
  • NIST entry on
  • Open addressing hash table removal algorithm from ICI programming language
    ICI programming language

    The ICI Programming Language is a general purpose Interpreter , computer programming language originally developed by Tim Long in 1992. It has dynamic typing and flexible data types, with the basic syntax, flow control constructs and operators of C ....
    , ici_set_unassign in (and other occurrences, with permission).
  • The Perl Wikibook - Perl Hash Variables
  • — free online text and file hashing with different algorithms
  • — opensource library
  • — opensource library
  • — two simple and clear examples of hash tables implementation in C with linear probing and chaining
  • MIT OCW lecture Video
  • MIT OCW lecture Video