Boyer–Moore string search algorithm

The Boyer–Moore string search algorithm is a particularly efficient string searching algorithm

String searching algorithm

String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....

, and it has been the standard benchmark for the practical string search literature. It was developed by Bob Boyer and J Strother Moore

J Strother Moore

J Strother Moore is a computer scientist, and he is a co-developer of the Boyer–Moore string search algorithm and the Boyer–Moore automated theorem prover, Nqthm. An example of the workings of the Boyer–Moore string search algorithm is given...

in 1977. The algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

preprocesses

Preprocessor

In computer science, a preprocessor is a program that processes its input data to produce output that is used as input to another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers...

the target string

String (computer science)

In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

(key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize

Amortization

Amortization is the process of decreasing, or accounting for, an amount over a period. The word comes from Middle English amortisen to kill, alienate in mortmain, from Anglo-French amorteser, alteration of amortir, from Vulgar Latin admortire to kill, from Latin ad- + mort-, mors death.When used...

the expense of the preprocessing by searching repeatedly). The execution time of the Boyer-Moore algorithm, while still linear in the size of the string being searched, can have a significantly lower constant factor than many other search algorithms: it doesn't need to check every character of the string to be searched, but rather skips over some of them. Generally the algorithm gets faster as the key being searched for becomes longer. Its efficiency derives from the fact that with each unsuccessful attempt to find a match between the search string and the text it is searching, it uses the information gained from that attempt to rule out as many positions of the text as possible where the string cannot match.

How the algorithm works

What people frequently find surprising about the Boyer-Moore algorithm, when they first encounter it, is that its verifications—its attempts to check whether a match exists at a particular position—work backwards. If it starts a search at the beginning of a text for the word "ANPANMAN

Anpanman

is one of the most popular anime cartoon series for young children in Japan. , the titular Anpanman is the most popular fictional character among people age 0 to 12 years in Japan in 10 consective years, according to research by Bandai. The series is written by Takashi Yanase, a Japanese writer of...

", for instance, it checks the eighth position of the text to see if it contains an "N". If it finds the "N", it moves to the seventh position to see if that contains the last "A" of the word, and so on until it checks the first position of the text for an "A".

Why Boyer-Moore takes this backward approach is clearer when we consider what happens if the verification fails—for instance, if instead of an "N" in the eighth position, we find an "X". The "X" doesn't appear anywhere in "ANPANMAN", and this means there is no match for the search string at the very start of the text—or at the next seven positions following it, since those would all fall across the "X" as well. After checking the eight characters of the word "ANPANMAN" for just one character "X", we're able to skip ahead and start looking for a match ending at the sixteenth position of the text.

This explains why the best-case performance of the algorithm, for a text of length

and a fixed pattern of length

, is

: in the best case, only one in

characters needs to be checked. This also explains the somewhat counter-intuitive result that the longer the pattern we are looking for, the faster the algorithm will usually be able to find it.

The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the value of the character that caused the mismatch; the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed. (Because these two tables return results indicating how far ahead in the text to "jump", they are sometimes called "jump tables", which should not be confused with the more common meaning of jump tables

Branch table

In computer programming, a branch table is a term used to describe an efficient method of transferring program control to another part of a program using a table of branch instructions. It is a form of multiway branch...

in computer science.) The algorithm will shift the larger of the two jump values when a mismatch occurs.

The first table

Populate the first table as follows. For each i less than the length of the search string, construct the pattern consisting of the last i characters of the string preceded by a mis-matched character, right-align the pattern and string, and record the fewest characters the pattern must shift left for a match.

For instance, for the search string ANPANMAN, the table would be as follows:

(NMAN signifies a substring in ANPANMAN consisting of a character that is not 'N' plus the characters 'MAN'.)

i	Pattern	Left Shift
0	N	It is true that the next letter to the left in 'ANPANMAN' is not N (it is A), therefore the pattern N must shift one left position for a match; then = 1
1	AN	AN is not a substring in ANPANMAN, then : Left_Shift is the number of letters in 'ANPANMAN' = 8
2	MAN	Substring MAN match with ANPANMAN three positions to the left. Then Left_Shift = 3
3	NMAN	We see that 'NMAN' is not a substring of 'ANPANMAN' but 'NMAN' is a possible substring 6 positions away to the left : ('NMANPANMAN'); then = 6
4	ANMAN	6
5	PANMAN	6
6	NPANMAN	6
7	ANPANMAN	6

The second table

The second table is easier to calculate: Start at the last character of the sought string and move towards the first character. Each time you move left, if the character you are on is not in the table already, add it; its Shift value is its distance from the rightmost character. All other characters receive a count equal to the length of the search string.

Example: For the string ANPANMAN, the second table would be as shown (for clarity, entries are shown in the order they would be added to the table):
(The N which is supposed to be zero is based on the second N from the right because we only record the calculation for the first

letters)

Character	Shift
A	1
M	2
N	3
P	5
all other characters	8

The amount of shift calculated by the second table is sometimes called the "bad character shift".

Performance of the Boyer-Moore string search algorithm

The worst-case to find all occurrences in an aperiodic text needs approximately

comparisons, hence the complexity is

, regardless whether the text contains a match or not. This proof took some years to determine. In the year the algorithm was devised, 1977, the maximum number of comparisons was shown to be no more than

; in 1980 it was shown to be no more than

, until Cole's result in Sep 1991.

C

include
include

define ALPHABET_LEN 255
define NOT_FOUND patlen
define max(a, b) ((a < b) ? b : a)

// delta1 table: delta1[c] contains the distance between the last
// character of pat and the rightmost occurence of c in pat.
// If c does not occur in pat, then delta1[c] = patlen.
// If c is at string[i] and c != pat[patlen-1], we can
// safely shift i over by delta1[c], which is the minimum distance
// needed to shift pat forward to get string[i] lined up
// with some character in pat.
// this algorithm runs in alphabet_len+patlen time.
void make_delta1(int *delta1, uint8_t *pat, int32_t patlen) {
int i;
for (i=0; i < ALPHABET_LEN; i++) {
delta1[i] = NOT_FOUND;
}
for (i=0; i < patlen-1; i++) {
delta1[pat[i]] = patlen-1 - i;
}
}

// true if the suffix of word starting from word[pos] is a prefix
// of word
int is_prefix(uint8_t *word, int wordlen, int pos) {
int i;
int suffixlen = wordlen - pos;
// could also use the strncmp library function here
for (i = 0; i < suffixlen; i++) {
if (word[i] != word[pos+i]) {
return 0;
}
}
return 1;
}

// length of the longest suffix of word ending on word[pos].
// suffix_length("dddbcabc", 8, 4) = 2
int suffix_length(uint8_t *word, int wordlen, int pos) {
int i;
// increment suffix length i to the first mismatch or beginning
// of the word
for (i = 0; (word[pos-i] word[wordlen-1-i]) && (i < pos); i++);
return i;
}

// delta2 table: given a mismatch at pat[pos], we want to align
// with the next possible full match could be based on what we
// know about pat[pos+1] to pat[patlen-1].
//
// In case 1:
// pat[pos+1] to pat[patlen-1] does not occur elsewhere in pat,
// the next plausible match starts at or after the mismatch.
// If, within the substring pat[pos+1 .. patlen-1], lies a prefix
// of pat, the next plausible match is here (if there are multiple
// prefixes in the substring, pick the longest). Otherwise, the
// next plausible match starts past the character aligned with
// pat[patlen-1].
//
// In case 2:
// pat[pos+1] to pat[patlen-1] does occur elsewhere in pat. The
// mismatch tells us that we are not looking at the end of a match.
// We may, however, be looking at the middle of a match.
//
// The first loop, which takes care of case 1, is analogous to
// the KMP table, adapted for a 'backwards' scan order with the
// additional restriction that the substrings it considers as
// potential prefixes are all suffixes. In the worst case scenario
// pat consists of the same letter repeated, so every suffix is
// a prefix. This loop alone is not sufficient, however:
// Suppose that pat is "ABYXCDEYX", and text is ".....ABYXCDEYX".
// We will match X, Y, and find B != E. There is no prefix of pat
// in the suffix "YX", so the first loop tells us to skip forward
// by 9 characters.
// Although superficially similar to the KMP table, the KMP table
// relies on information about the beginning of the partial match
// that the BM algorithm does not have.
//
// The second loop addresses case 2. Since suffix_length may not be
// unique, we want to take the minimum value, which will tell us
// how far away the closest potential match is.
void make_delta2(int *delta2, uint8_t *pat, int32_t patlen) {
int p;
int last_prefix_index = patlen-1;

// first loop
for (p=patlen-1; p>=0; p--) {
if (is_prefix(pat, patlen, p+1)) {
last_prefix_index = p+1;
}
delta2[p] = last_prefix_index + (patlen-1 - p);
}

// second loop
for (p=0; p < patlen-1; p++) {
int slen = suffix_length(pat, patlen, p);
if (pat[p - slen] != pat[patlen-1 - slen]) {
delta2[patlen-1 - slen] = patlen-1 - p + slen;
}
}
}

char* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) {
int i;
int delta1[ALPHABET_LEN];
int *delta2 = malloc(patlen * sizeof(int));
make_delta1(&delta1, pat, patlen);
make_delta2(delta2, pat, patlen);

i = patlen-1;
while (i < stringlen) {
int j = patlen-1;
while (j >= 0 && (string[i] pat[j])) {
--i;
--j;
}
if (j < 0) {
free(delta2);
return (string + i+1);
}

i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
return NULL;
}

Variants

The Turbo Boyer-Moore algorithm takes an additional constant amount of space to complete a search within

comparisons (as opposed to

for Boyer-Moore), where

is the number of characters in the text to be searched.

The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm that leaves out the "first table". The Boyer-Moore-Horspool algorithm requires (in the worst case)

comparisons, while the Boyer-Moore algorithm requires (in the worst case) only

comparisons.

External links

String Searching Applet animation
Original article
An example of the Boyer-Moore algorithm from the homepage of J Strother Moore
J Strother Moore
J Strother Moore is a computer scientist, and he is a co-developer of the Boyer–Moore string search algorithm and the Boyer–Moore automated theorem prover, Nqthm. An example of the workings of the Boyer–Moore string search algorithm is given...

, co-inventor of the algorithm
An explanation of the algorithm (with sample C code)
Cole et al., Tighter lower bounds on the exact complexity of string matching
An implementation of the algorithm in Ruby
Scala functional implementation with source code

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.