Count-Min sketch
Encyclopedia
The Count-Min sketch is a probabilistic
Randomized algorithm
A randomized algorithm is an algorithm which employs a degree of randomness as part of its logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random bits...

 sub-linear space
Complexity class
In computational complexity theory, a complexity class is a set of problems of related resource-based complexity. A typical complexity class has a definition of the form:...

 streaming algorithm
Streaming algorithm
In computer science, streaming algorithms are algorithms forprocessing data streams in which the input is presented as a sequence ofitems and can be examined in only a few passes...

 which can be used to summarize a data stream in many different ways. The algorithm was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan.

Setup

The data structure is parameterized by the constants and which determine the time and space needs and the probability of error of the queries. The algorithm needs a two dimensional array with width and depth called here count. A series of hash functions must be randomly drawn from a pairwise independent
Pairwise independence
In probability theory, a pairwise independent collection of random variables is a set of random variables any two of which are independent. Any collection of mutually independent random variables is pairwise independent, but some pairwise independent collections are not mutually independent...

 hash function
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

family.

For later convenience we assign and .

Update

When a new value arrives we update as follows: , . That is, for each row we take the corresponding hash function, apply it to the newly received value and add one to the column corresponding to the hash value.

Query

The array can then be used to estimate any of several different statistics at any point. If we want to estimate, for instance, the number of times a specific value appeared so far in the stream we would compute (this assume all added values are positive). This estimate has the guarantee that with probability .

Small modifications to the data structure can be used to sketch other different stream statistics.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK