Hash table - AbsoluteAstronomy.com

Computer science

Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, a hash table or hash map is a data structure

Data structure

In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...

that uses a hash function

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

to map identifying values, known as keys

Unique key

In relational database design, a unique key can uniquely identify each row in a table, and is closely related to the Superkey concept. A unique key comprises a single column or a set of columns. No two distinct rows in a table can have the same value in those columns if NULL values are not used...

(e.g., a person's name), to their associated values

Value (mathematics)

In mathematics, value commonly refers to the 'output' of a function. In the most basic case, that of unary, single-valued functions, there is one input and one output .The function f of the example is real-valued, since each and every possible function value is real...

(e.g., their telephone number). Thus, a hash table implements an associative array

Associative array

In computer science, an associative array is an abstract data type composed of a collection of pairs, such that each possible key appears at most once in the collection....

. The hash function is used to transform the key into the index (the hash) of an array

Array data type

In computer science, an array type is a data type that is meant to describe a collection of elements , each selected by one or more indices that can be computed at run time by the program. Such a collection is usually called an array variable, array value, or simply array...

element (the slot or bucket) where the corresponding value is to be sought.

Ideally, the hash function should map each possible key to a unique slot index, but this ideal is rarely achievable in practice (unless the hash keys are fixed; i.e. new entries are never added to the table after it is created). Instead, most hash table designs assume that hash collisions—different keys that map to the same hash value—will occur and must be accommodated in some way.

In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized

Amortized analysis

In computer science, amortized analysis is a method of analyzing algorithms that considers the entire sequence of operations of the program. It allows for the establishment of a worst-case bound for the performance of an algorithm irrespective of the inputs by looking at all of the operations...

) cost per operation.

In many situations, hash tables turn out to be more efficient than search tree

Search tree

In computer science, a search tree is a binary tree data structure in whose nodes data values are stored from some ordered set, in such a way that in-order traversal of the tree visits the nodes in ascending order of the stored values...

s or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative array

Associative array

In computer science, an associative array is an abstract data type composed of a collection of pairs, such that each possible key appears at most once in the collection....

s, database indexing, cache

Cache

In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...

s, and sets.

Hash function

At the heart of the hash table algorithm is a simple array of items; this is often simply called the hash table. Hash table algorithms calculate an index from the data item's key and use this index to place the data into the array. The implementation of this calculation is the hash function

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

, f:

index = f(key, arrayLength)

The hash function calculates an index within the array from the data key. arrayLength is the size of the array.
For assembly language

Assembly language

An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...

or other low-level programs, a trivial hash function can often create an index with just one or two inline machine instructions.

Choosing a good hash function

A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve. Poor hashing usually degrades hash table performance by a constant factor, but hashing is often only a small part of the overall computation.

A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions, and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g. a Pearson's chi-squared test for uniform distributions
.

The distribution needs to be uniform only for table sizes s that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of s, the hash function needs to be uniform only when s is a power

Power function

In mathematics, a power function is a function of the form , where c and a are constant real numbers and x is a variable.Power functions are a special case of power law relationships, which appear throughout mathematics and statistics....

of two. On the other hand, some hashing algorithms provide uniform hashes only when s is a prime number

Prime number

A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. A natural number greater than 1 that is not a prime number is called a composite number. For example 5 is prime, as only 1 and 5 divide it, whereas 6 is composite, since it has the divisors 2...

.

For open addressing

Open addressing

Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternate locations in the array until either the target record is found, or an unused array slot is found, which indicates that...

schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.

Cryptographic hash functions are believed to provide good hash functions for any table size s, either by modulo

Modulo operation

In computing, the modulo operation finds the remainder of division of one number by another.Given two positive numbers, and , a modulo n can be thought of as the remainder, on division of a by n...

reduction or by bit masking. They may also be appropriate, if there is a risk of malicious users trying to sabotage a network service by submitting requests designed to generate a large number of collisions in the server's hash tables. However, these presumed qualities are hardly worth their much larger computational cost and algorithmic complexity, and the risk of sabotage can be avoided by cheaper methods (such as applying a secret salt

Salt (cryptography)

In cryptography, a salt consists of random bits, creating one of the inputs to a one-way function. The other input is usually a password or passphrase. The output of the one-way function can be stored rather than the password, and still be used for authenticating users. The one-way function...

to the data, or using a universal hash function).

Some authors claim that good hash functions should have the avalanche effect

Avalanche effect

In cryptography, the avalanche effect refers to a desirable property of cryptographic algorithms, typically block ciphers and cryptographic hash functions. The avalanche effect is evident if, when an input is changed slightly the output changes significantly...

; that is, a single-bit change in the input key should affect, on average, half the bits in the output. Some popular hash functions do not have this property.

Perfect hash function

If all keys are known ahead of time, a perfect hash function

Perfect hash function

A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented.- Properties...

can be used to create a perfect hash table that has no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well.

Perfect hashing allows for constant time lookups in the worst case. This is in contrast to most chaining and open addressing methods, where the time for lookup is low on average, but may be very large (proportional to the number of entries) for some sets of keys.

Collision resolution

Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2,500 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday problem there is a 95% chance of at least two of the keys being hashed to the same slot.

Therefore, most hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values.

Load factor

The performance of most collision resolution methods does not depend directly on the number n of stored entries, but depends strongly on the table's load factor, the ratio n/s between n and the size s of its array of buckets. Sometimes this is referred to as the fill factor, as it represents the portion of the s buckets in the structure that are filled with one of the n stored entries. With a good hash function, the average lookup cost is nearly constant as the load factor increases from 0 up to 0.7(about 2/3 full) or so. Beyond that point, the probability of collisions and the cost of handling them increases.

On the other hand, as the load factor approaches zero, the proportion of the unused areas in the hash table increases but there is not necessarily any improvement in the search cost, resulting in wasted memory.

Separate chaining

In the strategy known as separate chaining, direct chaining, or simply chaining, each slot of the bucket array is a pointer to a linked list

Linked list

In computer science, a linked list is a data structure consisting of a group of nodes which together represent a sequence. Under the simplest form, each node is composed of a datum and a reference to the next node in the sequence; more complex variants add additional links...

that contains the key-value pairs that hashed to the same location. Lookup requires scanning the list for an entry with the given key. Insertion requires adding a new entry record to either end of the list belonging to the hashed slot. Deletion requires searching the list and removing the element. (The technique is also called open hashing or closed addressing, which should not be confused with 'open addressing' or 'closed hashing'.)

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform

SUHA

In computer science, SUHA or the uniform hashing assumption is a basic assumption that facilitates the mathematical analysis of hash tables. The assumption states that a hypothetical hashing function will evenly distribute items into the slots of a hash table...

, the average cost of a lookup depends only on the average number of keys per bucket—that is, on the load factor.

Chained hash tables remain effective even when the number of table entries n is much higher than the number of slots. Their performance degrades more gracefully (linearly) with the load factor. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree.

For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries; so the worst-case cost is proportional to the number n of entries in the table.

The bucket chains are often implemented as ordered lists

Sequence

In mathematics, a sequence is an ordered list of objects . Like a set, it contains members , and the number of terms is called the length of the sequence. Unlike a set, order matters, and exactly the same elements can appear multiple times at different positions in the sequence...

, sorted by the key field; this choice approximately halves the average cost of unsuccessful lookups, compared to an unordered list. However, if some keys are much more likely to come up than others, an unordered list with move-to-front heuristic may be more effective. More sophisticated data structures, such as balanced search trees, are worth considering only if the load factor is large (about 10 or more), or if the hash distribution is likely to be very non-uniform, or if one must guarantee good performance even in the worst-case. However, using a larger table and/or a better hash function may be even more effective in those cases.

Chained hash tables also inherit the disadvantages of linked lists. When storing small keys and values, the space overhead of the next pointer in each entry record can be significant. An additional disadvantage is that traversing a linked list has poor cache performance

Locality of reference

In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...

, making the processor cache ineffective.

Separate chaining with list heads

Some chaining implementations store the first record of each chain in the slot array itself. The purpose is to increase cache efficiency of hash table access. To save memory space, such hash tables often have about as many slots as stored entries, meaning that many slots have two or more entries.

Separate chaining with other structures

Instead of a list, one can use any other data structure that supports the required operations. For example, by using a self-balancing tree

Self-balancing binary search tree

In computer science, a self-balancing binary search tree is any node based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions....

, the theoretical worst-case time of common hash table operations (insertion, deletion, lookup) can be brought down to O(log n)

Big O notation

In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...

rather than O(n). However, this approach is only worth the trouble and extra memory cost if long delays must be avoided at all costs (e.g. in a real-time application), or if one expects to have many entries hashed to the same slot (e.g. if one expects extremely non-uniform or even malicious key distributions).

The variant called array hash table uses a dynamic array

Dynamic array

In computer science, a dynamic array, growable array, resizable array, dynamic table, or array list is a random access, variable-size list data structure that allows elements to be added or removed...

to store all the entries that hash to the same slot. Each newly inserted entry gets appended to the end of the dynamic array that is assigned to the slot. The dynamic array is resized in an exact-fit manner, meaning it is grown only by as many bytes as needed. Alternative techniques such as growing the array by block sizes or pages were found to improve insertion performance, but at a cost in space. This variation makes more efficient use of CPU caching

CPU cache

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...

and the translation lookaside buffer

Translation Lookaside Buffer

A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...

(TLB), because slot entries are stored in sequential memory positions. It also dispenses with the next pointers that are required by linked lists, which saves space. Despite frequent array resizing, space overheads incurred by operating system such as memory fragmentation, were found to be small.

An elaboration on this approach is the so-called dynamic perfect hashing

Dynamic perfect hashing

In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure. This technique is useful for situations where fast queries, insertions, and deletions must be made on a large set, S, of elements.-Details:In this method, the entries...

, where a bucket that contains k entries is organized as a perfect hash table with k² slots. While it uses more memory (n² slots for n entries, in the worst case), this variant has guaranteed constant worst-case lookup time, and low amortized time for insertion.

Open addressing

In another strategy, called open addressing

Open addressing

, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table. The name "open addressing" refers to the fact that the location ("address") of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with "open hashing" or "closed addressing" that usually mean separate chaining.)

Well-known probe sequences include:

Linear probing
Linear probing
Linear probing is a scheme in computer programming for resolving hash collisions of values of hash functions by sequentially searching the hash table for a free location. This is accomplished using two values - one as a starting value and one as an interval between successive values in modular...

, in which the interval between probes is fixed (usually 1)
Quadratic probing
Quadratic probing
Quadratic probing is a scheme in computer programming for resolving collisions in hash tables.It is an open addressing method to handle overflows after a collision takes place in some bucket of a hash table....

, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the starting value given by the original hash computation
Double hashing
Double hashing
Double hashing is a computer programming technique used in hash tables to resolve hash collisions, cases when two different values to be searched for produce the same hash key...

, in which the interval between probes is computed by another hash function

A drawback of all these open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array. In fact, even with good hash functions, their performance dramatically degrades when the load factor grows beyond 0.7 or so. Thus a more aggressive resize scheme is needed. Separate linking works correctly with any load factor, although performance is likely to be reasonable if it is kept below 2 or so. For many applications, these restrictions mandate the use of dynamic resizing, with its attendant costs.

Open addressing schemes also put more stringent requirements on the hash function: besides distributing the keys more uniformly over the buckets, the function must also minimize the clustering of hash values that are consecutive in the probe order. Using separate chaining, the only concern is that too many objects map to the same hash value; whether they are adjacent or nearby is completely irrelevant.

Even experienced programmers may find such clustering hard to avoid.

Open addressing only saves memory if the entries are small (less than 4 times the size of a pointer) and the load factor is not too small. If the load factor is close to zero (that is, there are far more buckets than stored entries), open addressing is wasteful even if each entry is just two words.

Open addressing avoids the time overhead of allocating each new entry record, and can be implemented even in the absence of a memory allocator. It also avoids the extra indirection required to access the first entry of each bucket (that is, usually the only one). It also has better locality of reference

Locality of reference

, particularly with linear probing. With small record sizes, these factors can yield better performance than chaining, particularly for lookups.

Hash tables with open addressing are also easier to serialize

Serialization

In computer science, in the context of data storage and transmission, serialization is the process of converting a data structure or object state into a format that can be stored and "resurrected" later in the same or another computer environment...

, because they do not use pointers.

On the other hand, normal open addressing is a poor choice for large elements, because these elements fill entire CPU cache

CPU cache

lines (negating the cache advantage), and a large amount of space is wasted on large empty table slots. If the open addressing table only stores references to elements (external storage), it uses space comparable to chaining even for large records but loses its speed advantage.

Generally speaking, open addressing is better used for hash tables with small records that can be stored within the table (internal storage) and fit in a cache line. They are particularly suitable for elements of one word or less. If the table is expected to have a high load factor, the records are large, or the data is variable-sized, chained hash tables often perform as well or better.

Ultimately, used sensibly, any kind of hash table algorithm is usually fast enough; and the percentage of a calculation spent in hash table code is low. Memory usage is rarely considered excessive. Therefore, in most cases the differences between these algorithms are marginal, and other considerations typically come into play.

Coalesced hashing

A hybrid of chaining and open addressing, coalesced hashing

Coalesced hashing

Coalesced hashing, also called coalesced chaining, is a strategy of collision resolution in a hash table that forms a hybrid of separate chaining and open addressing. In a separate chaining hash table, items that hash to the same address are placed on a list at that address...

links together chains of nodes within the table itself. Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more elements than table slots.

Robin Hood hashing

One interesting variation on double-hashing collision resolution is Robin Hood hashing. The idea is that a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position. The net effect of this is that it reduces worst case search times in the table. This is similar to Knuth's ordered hash tables except that the criterion for bumping a key does not depend on a direct relationship between the keys. Since both the worst case and the variation in the number of probes is reduced dramatically, an interesting variation is to probe the table starting at the expected successful probe value and then expand from that position in both directions.
External Robin Hashing is an extension of this algorithm where the table is stored in an external file and each table position corresponds to a fixed-sized page or bucket with B records.

Cuckoo hashing

Another alternative open-addressing solution is cuckoo hashing

Cuckoo hashing

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table. Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in 2001...

, which ensures constant lookup time in the worst case, and constant amortized time for insertions and deletions. It uses two or more hash functions, which means any key/value pair could be in two or more locations. For lookup, the first hash function is used; if the key/value is not found, then the second hash function is used, and so on. If a collision happens during insertion, then the key is re-hashed with the second hash function to map it to another bucket. If all hash functions are used and there is still a collision, then the key it collided with is removed to make space for the new key, and the old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that location also results in a collision, then the process repeats until there is no collision or the process traverses all the buckets, at which point the table is resized. By combining multiple hash functions with multiple cells per bucket, very high space utilisation can be achieved.

Hopscotch hashing

Another alternative open-addressing solution is hopscotch hashing

Hopscotch hashing

Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing was introduced by Maurice Herlihy, Nir Shavit and Moran Tzafrir in 2008...

, which combines the approaches of cuckoo hashing

Cuckoo hashing

and linear probing

Linear probing

Linear probing is a scheme in computer programming for resolving hash collisions of values of hash functions by sequentially searching the hash table for a free location. This is accomplished using two values - one as a starting value and one as an interval between successive values in modular...

, yet seems in general to avoid their limitations. In particular it works well even when the load factor grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table.

The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original hashed bucket, where a given entry is always found. Thus, search is limited to the number of entries in this neighborhood, which is logarithmic in the worst case, constant on average, and with proper alignment of the neighborhood typically requires one cache miss. When inserting an entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the neighborhood, instead of items being moved out with the hope of eventually finding an empty slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the neighborhood property of any of the buckets along the way. In the end, the open slot has been moved into the neighborhood, and the entry being inserted can be added to it.

Dynamic resizing

To keep the load factor under a certain limit, e.g. under 3/4, many table implementations expand the table when items are inserted. For example, in Java's

Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

HashMap class the default load factor threshold for table expansion is 0.75.
Since buckets are usually implemented on top of a dynamic array

Dynamic array

and any constant proportion for resizing greater than 1 will keep the load factor under the desired limit, the exact choice of the constant is determined by the same space-time tradeoff

Space-time tradeoff

In computer science, a space–time or time–memory tradeoff is a situation where the memory use can be reduced at the cost of slower program execution...

as for dynamic array

Dynamic array

s.

Resizing is accompanied by a full or incremental table rehash whereby existing items are mapped to new bucket locations.

To limit the proportion of memory wasted due to empty buckets, some implementations also shrink the size of the table—followed by a rehash—when items are deleted. From the point of space-time tradeoff

Space-time tradeoff

In computer science, a space–time or time–memory tradeoff is a situation where the memory use can be reduced at the cost of slower program execution...

s, this operation is similar to the deallocation in dynamic arrays.

Resizing by copying all entries

A common approach is to automatically trigger a complete resizing when the load factor exceeds some threshold r_max. Then a new larger table is allocated, all the entries of the old table are removed and inserted into this new table, and the old table is returned to the free storage pool. Symmetrically, when the load factor falls below a second threshold r_min, all entries are moved to a new smaller table.

If the table size increases or decreases by a fixed percentage at each expansion, the total cost of these resizings, amortized

Amortized analysis

over all insert and delete operations, is still a constant, independent of the number of entries n and of the number m of operations performed.

For example, consider a table that was created with the minimum possible size and is doubled each time the load ratio exceeds some threshold. If m elements are inserted into that table, the total number of extra re-insertions that occur in all dynamic resizings of the table is at most m−1. In other words, dynamic resizing roughly doubles the cost of each insert or delete operation.

Incremental resizing

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually:

During the resize, allocate the new hash table, but keep the old table unchanged.
In each lookup or delete operation, check both tables.
Perform insertion operations only in the new table.
At each insertion also move r elements from the old table to the new table.
When all elements are removed from the old table, deallocate it.

To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it
is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.

Monotonic keys

If it is known that key values will always increase monotonically

Monotonic function

In mathematics, a monotonic function is a function that preserves the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of order theory....

, then a variation of consistent hashing

Consistent hashing

Consistent hashing is a special kind of hashing. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped...

can be achieved by keeping a list of the single most recent key value at each hash table resize operation. Upon lookup, keys that fall in the ranges defined by these list entries are directed to the appropriate hash function—and indeed hash table—both of which can be different for each range. Since it is common to grow the overall number of entries by doubling, there will only be O(lg(N)) ranges to check, and binary search time for the redirection would be O(lg(lg(N))).

Performance analysis

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with Ω(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Adding rehashing to this model is straightforward. As in a dynamic array

Dynamic array

, geometric resizing by a factor of b implies that only k/bⁱ keys are inserted i or more times, so that the total number of insertions is bounded above by bk/(b-1), which is O(k). By using rehashing to maintain k < n, tables using both chaining and open addressing can have unlimited elements and perform successful lookup in a single comparison for the best choice of hash function.

In more realistic models, the hash function is a random variable

Random variable

In probability and statistics, a random variable or stochastic variable is, roughly speaking, a variable whose value results from a measurement on some type of random process. Formally, it is a function from a probability space, typically to the real numbers, which is measurable functionmeasurable...

over a probability distribution of hash functions, and performance is computed on average over the choice of hash function. When this distribution is uniform, the assumption is called "simple uniform hashing" and it can be shown that hashing with chaining requires Θ(1 + k/n) comparisons on average for an unsuccessful lookup, and hashing with open addressing requires Θ(1/(1 - k/n)). Both these bounds are constant, if we maintain k/n < c using table resizing, where c is a fixed constant less than 1.

Advantages

The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when the number of entries is large (thousands or more). Hash tables are particularly efficient when the maximum number of entries can be predicted in advance, so that the bucket array can be allocated once with the optimum size and never resized.

If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect (see below). In this case the keys need not be stored in the table.

Drawbacks

Although operations on a hash table take constant time on average, the cost of a good hash function can be significantly higher than the inner loop of the lookup algorithm for a sequential list or search tree. Thus hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)

For certain string processing applications, such as spell-checking

Spell checker

In computing, a spell checker is an application program that flags words in a document that may not be spelled correctly. Spell checkers may be stand-alone capable of operating on a block of text, or as part of a larger application, such as a word processor, email client, electronic dictionary,...

, hash tables may be less efficient than trie

Trie

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the...

s, finite automata, or Judy array

Judy array

In computer science and software engineering, a Judy array is a data structure that has high performance, low memory usage and implements an associative array. Unlike normal arrays, Judy arrays may be sparse, that is, they may have large ranges of unassigned indices. They can be used for storing...

s. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case.

The entries stored in a hash table can be enumerated efficiently (at constant cost per entry), but only in some pseudo-random order. Therefore, there is no efficient way to locate an entry whose key is nearest to a given key. Listing all n entries in some specific order generally requires a separate sorting step, whose cost is proportional to log(n) per entry. In comparison, ordered search trees have lookup and insertion cost proportional to log(n), but allow finding the nearest key at about the same cost, and ordered enumeration of all entries at constant cost per entry.

If the keys are not stored (because the hash function is collision-free), there may be no easy way to enumerate the keys that are present
in the table at any given moment.

Although the average cost per operation is constant and fairly small, the cost of a single operation may be quite high. In particular, if the hash table uses dynamic resizing, an insertion or deletion operation may occasionally take time proportional to the number of entries. This may be a serious drawback in real-time or interactive applications.

Hash tables in general exhibit poor locality of reference

Locality of reference

—that is, the data to be accessed is distributed seemingly at random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor cache

CPU cache

misses that cause long delays. Compact data structures such as arrays searched with linear search

Linear search

In computer science, linear search or sequential search is a method for finding a particular value in a list, that consists of checking every one of its elements, one at a time and in sequence, until the desired one is found....

may be faster, if the table is relatively small and keys are integers or other short strings. According to Moore's Law

Moore's Law

Moore's law describes a long-term trend in the history of computing hardware: the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years....

, cache sizes are growing exponentially and so what is considered "small" may be increasing. The optimal performance point varies from system to system.

Hash tables become quite inefficient when there are many collisions. While extremely uneven hash distributions are extremely unlikely to arise by chance, a malicious adversary

Black hat

A black hat is the villain or bad guy, especially in a western movie in which such a character would stereotypically wear a black hat in contrast to the hero's white hat, especially in black and white movies....

with knowledge of the hash function may be able to supply information to a hash that creates worst-case behavior by causing excessive collisions, resulting in very poor performance (e.g., a denial of service attack). In critical applications, either universal hashing

Universal hashing

Using universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property . This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary...

can be used or a data structure with better worst-case guarantees may be preferable.

Associative arrays

Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative array

Associative array

In computer science, an associative array is an abstract data type composed of a collection of pairs, such that each possible key appears at most once in the collection....

s (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming language

Programming language

A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

s like AWK, Perl

Perl

Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

, and PHP

PHP

PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

.

When storing a new item into a multimap and a hash collision occurs, the multimap unconditionally stores both items.

When storing a new item into a typical associative array and a hash collision occurs, but the actual keys themselves are different, the associative array likewise stores both items. However, if the key of the new item exactly matches the key of an old item, the associative array typically erases the old item and overwrites it with the new item, so every item in the table has a unique key.

Database indexing

Hash tables may also be used as disk-based data structures and database indices

Index (database)

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space...

(such as in dbm

Dbm

dbm was the first of a family of simple database engines, originally written by Ken Thompson and released by AT&T in 1979. The name is a three letter acronym for database manager....

) although B-tree

B-tree

In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children...

s are more popular in these applications.

Caches

Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.

Sets

Besides recovering the entry that has a given key, many hash table implementations can also tell whether such an entry exists or not.

Those structures can therefore be used to implement a set data structure, which merely records whether a given key belongs to a specified set of keys. In this case, the structure can be simplified by eliminating all parts that have to do with the entry values. Hashing can be used to implement both static and dynamic sets.

Object representation

Several dynamic languages, such as Perl

Perl

, Python

Python (programming language)

Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

, JavaScript

JavaScript

JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....

, and Ruby

Ruby (programming language)

Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

, use hash tables to implement objects. In this representation, the keys are the names of the members and methods of the object, and the values are pointers to the corresponding member or method.

Unique data representation

Hash tables can be used by some programs to avoid creating multiple character strings with the same contents. For that purpose, all strings in use by the program are stored in a single hash table, which is checked whenever a new string has to be created. This technique was introduced in Lisp interpreters under the name hash consing

Hash consing

In computer science, particularly in functional programming, hash consing is a technique used to share values that are structurally equal. The term hash consing originates from implementations of Lisp that attempt to reuse cons cells that have been constructed before, avoiding the penalty of memory...

, and can be used with many other kinds of data (expression trees in a symbolic algebra system, records in a database, files in a file system, binary decision diagrams, etc.)

In programming languages

Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules. In C++11, for example, the

unordered mapUnordered map (C++)
unordered_map is a class template representing a hash table in the C++ Technical Report 1  and the C++11 standard. It is similar to the hash_map class template of the original STL which was also included in several implementations of the C++ Standard Library unordered_map is a class template...

class provides hash tables for keys and values of arbitrary type.

In PHP

PHP

5, the Zend 2 engine uses one of the hash functions from Daniel J. Bernstein

Daniel J. Bernstein

Daniel Julius Bernstein is a mathematician, cryptologist, programmer, and professor of mathematics at the University of Illinois at Chicago...

to generate the hash values used in managing the mappings of data pointers stored in a HashTable. In the PHP source code, it is labelled as "DJBX33A" (Daniel J. Bernstein, Times 33 with Addition).

Python

Python (programming language)

's built-in hash table implementation, in the form of the dict type, as well as Perl

Perl

's hash type (%) are highly optimized as they are used internally to implement namespaces.

In the .NET Framework

.NET Framework

The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

, support for hash tables is provided via the non-generic Hashtable and generic Dictionary classes, which store key-value pairs, and the generic HashSet class, which stores only values.

Independent packages

Google Sparse Hash The Google SparseHash project contains several C++ hash-map implementations in use at Google, with different performance characteristics, including an implementation that optimizes for memory use and one that optimizes for speed. The memory-optimized one is extremely memory-efficient with only 2 bits/entry of overhead.
SunriseDD An open source C library for hash table storage of arbitrary data objects with lock-free lookups, built-in reference counting and guaranteed order iteration. The library can participate in external reference counting systems or use its own built-in reference counting. It comes with a variety of hash functions and allows the use of runtime supplied hash functions via callback mechanism. Source code is well documented.
uthash This is an easy-to-use hash table for C structures.

History

The idea of hashing arose independently in different places. In January 1953, H. P. Luhn wrote an internal IBM memorandum that used hashing with chaining. G. N. Amdahl

Gene Amdahl

Gene Myron Amdahl is a Norwegian-American computer architect and high-tech entrepreneur, chiefly known for his work on mainframe computers at IBM and later his own companies, especially Amdahl Corporation...

, E. M. Boehme, N. Rochester, and Arthur Samuel

Arthur Samuel

Arthur Lee Samuel was an American pioneer in the field of computer gaming and artificial intelligence. The Samuel Checkers-playing Program appears to be the world's first self-learning program, and as such a very early demonstration of the fundamental concept of artificial intelligence...

implemented a program using hashing at about the same time. Open addressing with linear probing (relatively prime stepping) is credited to Amdahl, but Ershov (in Russia) had the same idea.

External links

A Hash Function for Hash Table Lookup by Bob Jenkins.
Hash Tables by SparkNotes—explanation using C
Hash functions by Paul Hsieh
Design of Compact and Efficient Hash Tables for Java
Libhashish hash library
NIST entry on hash tables
Open addressing hash table removal algorithm from ICI programming language
ICI programming language
ICI is a general purpose interpreted, computer programming language originally developed by Tim Long in the late 1980s. It has dynamic typing and flexible data types, with the basic syntax, flow control constructs and operators of C. It can be considered broadly similar to Perl, with which it is...

, ici_set_unassign in set.c (and other occurrences, with permission).
A basic explanation of how the hash table works by Reliable Software
Lecture on Hash Tables
Hash-tables in C—two simple and clear examples of hash tables implementation in C with linear probing and chaining
MIT's Introduction to Algorithms: Hashing 1 MIT OCW lecture Video
MIT's Introduction to Algorithms: Hashing 2 MIT OCW lecture Video
How to sort a HashMap (Java) and keep the duplicate entries

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Hash function

Choosing a good hash function

Perfect hash function

Collision resolution

Load factor

Separate chaining

Separate chaining with list heads

Separate chaining with other structures

Open addressing

Coalesced hashing

Robin Hood hashing

Cuckoo hashing

Hopscotch hashing

Dynamic resizing

Resizing by copying all entries

Incremental resizing

Monotonic keys

Other solutions

Performance analysis

Advantages

Drawbacks

Associative arrays

Database indexing

Caches

Sets

Object representation

Unique data representation

In programming languages

Independent packages

History

See also

Related data structures

External links