Rabin-Karp string search algorithm - AbsoluteAstronomy.com

The Rabin–Karp algorithm is a string searching algorithm

String searching algorithm

String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....

created by Michael O. Rabin

Michael O. Rabin

Michael Oser Rabin , is an Israeli computer scientist and a recipient of the Turing Award.- Biography :Rabin was born in 1931 in Breslau, Germany, , the son of a rabbi. In 1935, he emigrated with his family to Mandate Palestine...

and Richard M. Karp in 1987 that uses hashing

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm). In contrast, the Aho–Corasick string matching algorithm

Aho–Corasick string matching algorithm

The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings within an input text. It matches all patterns simultaneously...

has asymptotic worst-time complexity O(n+m) in space O(m).

A practical application of Rabin–Karp is detecting plagiarism

Plagiarism

Plagiarism is defined in dictionaries as the "wrongful appropriation," "close imitation," or "purloining and publication" of another author's "language, thoughts, ideas, or expressions," and the representation of them as one's own original work, but the notion remains problematic with nebulous...

. Given source material, Rabin–Karp can rapidly search through a paper for instances of sentences from the source material, ignoring details such as case and punctuation. Because of the abundance of the sought strings, single-string searching algorithms are impractical.

Shifting substrings search and competing algorithms

A brute-force substring search algorithm checks all possible positions:

1 function NaiveSearch(string s[1..n], string sub[1..m])
2 for i from 1 to n-m+1
3 for j from 1 to m
4 if s[i+j-1] ≠ sub[j]
5 jump to next iteration of outer loop
6 return i
7 return not found

This algorithm works well in many practical cases, but can exhibit relatively long running times on certain examples, such as searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s, in which case it exhibits its worst-case Θ(mn) time.

The Knuth–Morris–Pratt algorithm

Knuth–Morris–Pratt algorithm

The Knuth–Morris–Pratt string searching algorithm searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing...

reduces this to Θ(n) time using precomputation to examine each text character only once; the Boyer–Moore algorithm

Boyer–Moore string search algorithm

The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature. It was developed by Bob Boyer and J Strother Moore in 1977...

skips forward not by 1 character, but by as many as possible for the search to succeed, effectively decreasing the number of times we iterate through the outer loop, so that the number of characters examined can be as small as n/m in the best case. The Rabin–Karp algorithm focuses instead on speeding up lines 3-6.

Use of hashing for shifting substring search

Rather than pursuing more sophisticated skipping, the Rabin–Karp algorithm seeks to speed up the testing of equality of the pattern to the substrings in the text by using a hash function

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

. A hash function is a function which converts every string into a numeric value, called its hash value; for example, we might have hash("hello")=5. Rabin–Karp exploits the fact that if two strings are equal, their hash values are also equal. Thus, it would seem all we have to do is compute the hash value of the substring we're searching for, and then look for a substring with the same hash value.

However, there are two problems with this. First, because there are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long substrings. Luckily, a good hash function promises us that on most reasonable inputs, this won't happen too often, which keeps the average search time good.

The algorithm is as shown:

1 function RabinKarp(string s[1..n], string sub[1..m])
2 hsub := hash(sub[1..m]); hs := hash(s[1..m])
3 for i from 1 to n-m+1
4 if hs = hsub
5 if s[i..i+m-1] = sub
6 return i
7 hs := hash(s[i+1..i+m])
8 return not found

Lines 2, 5, and 7 each require Θ(m) time. However, line 2 is only executed once, and line 5 is only executed if the hash values match, which is unlikely to happen more than a few times. Line 4 is executed n times, but only requires constant time. So the only problem is line 7.

If we naively recompute the hash value for the substring s[i+1..i+m], this would require Θ(m) time, and since this is done on each loop, the algorithm would require Ω(mn) time, the same as the most naive algorithms. The trick to solving this is to note that the variable hs already contains the hash value of s[i..i+m-1]. If we can use this to compute the next hash value in constant time, then our problem will be solved.

We do this using what is called a rolling hash

Rolling hash

A rolling hash is a hash function where the input is hashed in a window that moves through the input.A few hash functions allow a rolling hash to be computed very quickly -- the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new...

. A rolling hash is a hash function specially designed to enable this operation. One simple example is adding up the values of each character in the substring. Then, we can use this formula to compute the next hash value in constant time:
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
This simple function works, but will result in statement 5 being executed more often than other more sophisticated rolling hash functions such as those discussed in the next section.

Notice that if we're very unlucky, or have a very bad hash function such as a constant function, line 5 might very well be executed n times, on every iteration of the loop. Because it requires Θ(m) time, the whole algorithm then takes a worst-case Θ(mn) time.

Hash function used

The key to Rabin–Karp performance is the efficient computation of hash values of the successive substrings of the text. One popular and effective rolling hash function treats every substring as a number in some base, the base being usually a large prime

Prime number

A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. A natural number greater than 1 that is not a prime number is called a composite number. For example 5 is prime, as only 1 and 5 divide it, whereas 6 is composite, since it has the divisors 2...

. For example, if the substring is "hi" and the base is 101, the hash value would be 104 × 101¹ + 105 × 101⁰ = 10609 (ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

of 'h' is 104 and of 'i' is 105).

Technically, this algorithm is only similar to the true number in a non-decimal system representation, since for example we could have the "base" less than one of the "digits". See hash function

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

for a much more detailed discussion. The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.

For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 × 101² (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 × 101⁰ = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.

Theoretically, there exist other algorithms that could provide convenient recomputation, e.g. multiplying together ASCII values of all characters so that shifting substring would only entail dividing by the first character and multiplying by the last. The limitation, however, is the limited size of the integer data type

Data type

In computer programming, a data type is a classification identifying one of various types of data, such as floating-point, integer, or Boolean, that determines the possible values for that type; the operations that can be done on values of that type; the meaning of the data; and the way values of...

and the necessity of using modular arithmetic

Modular arithmetic

In mathematics, modular arithmetic is a system of arithmetic for integers, where numbers "wrap around" after they reach a certain value—the modulus....

to scale down the hash results, for which see hash function

Hash function

A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...

article; meanwhile, those naive hash functions that would not produce large numbers quickly, like just adding ASCII values, are likely to cause many hash collision

Hash collision

Not to be confused with wireless packet collision.In computer science, a collision or clash is a situation that occurs when two distinct pieces of data have the same hash value, checksum, fingerprint, or cryptographic digest....

s and hence slow down the algorithm. Hence the described hash function is typically the preferred one in Rabin–Karp.

Rabin–Karp and multiple pattern search

Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm

Knuth–Morris–Pratt algorithm

, Boyer–Moore string search algorithm

Boyer–Moore string search algorithm

and other faster single pattern string searching algorithm

String searching algorithm

s because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.

That is, if we want to find any of a large number, say k, fixed length patterns in a text, we can create a simple variant of Rabin–Karp that uses a Bloom filter

Bloom filter

A Bloom filter, conceived by Burton Howard Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not; i.e. a query returns either "inside set " or "definitely not in set"...

or a set data structure to check whether the hash of a given string belongs to a set of hash values of patterns we are looking for:

function RabinKarpSet(string s[1..n], set of string subs, m):
set hsubs := emptySet
for each sub in subs
insert hash(sub[1..m]) into hsubs
hs := hash(s[1..m])
for i from 1 to n-m+1
if hs ∈ hsubs and s[i..i+m-1] ∈ subs
return i
hs := hash(s[i+1..i+m])
return not found

We assume all the substrings have a fixed length m.

A naïve way to search for k patterns is to repeat a
single-pattern search taking O(n) time, totalling in O(n k) time. In contrast, the variant Rabin–Karp above can find all k patterns in O(n+k) time in expectation, because a hash table checks whether a substring hash equals any of the pattern hashes in O(1) time.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.