Bitap algorithm - AbsoluteAstronomy.com

The bitap algorithm is an approximate string matching

Approximate string matching

In computing, approximate string matching is the technique of finding strings that match a pattern approximately...

algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

— if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operation

Bitwise operation

A bitwise operation operates on one or more bit patterns or binary numerals at the level of their individual bits. This is used directly at the digital hardware level as well as in microcode, machine code and certain kinds of high level languages...

s, which are extremely fast.

The bitap algorithm is perhaps best known as one of the underlying algorithms of the Unix

Unix

Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

utility

Programming tool

A programming tool or software development tool is a program or application that software developers use to create, debug, maintain, or otherwise support other programs and applications...

agrep

Agrep

agrep is a proprietary fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system...

, written by Udi Manber

Udi Manber

Udi Manber is an Israeli computer scientist. He is one of the authors of agrep and GLIMPSE. As of April 2008, he is employed by Google as one of their vice presidents of engineering.-Biography:...

, Sun Wu, and Burra Gopal. Manber and Wu's original paper gives extensions of the algorithm to deal with fuzzy matching of general regular expression

Regular expression

In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s.

Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Once it has been implemented for a given alphabet and word length m, however, its running time

Running Time

Running Time may refer to:* Running Time * see Analysis of algorithms...

is completely predictable — it runs in O

Big O notation

In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...

(mn) operations, no matter the structure of the text or the pattern.

The bitap algorithm for exact string searching was invented by Bálint Dömölki in 1964 and extended by R. K. Shyamasundar in 1977, before being reinvented in the context of fuzzy string searching by Manber

Udi Manber

Udi Manber is an Israeli computer scientist. He is one of the authors of agrep and GLIMPSE. As of April 2008, he is employed by Google as one of their vice presidents of engineering.-Biography:...

and Wu in 1991 based on work done by Ricardo Baeza-Yates

Ricardo Baeza-Yates

Ricardo Baeza-Yates is a Chilean computer scientist and currently VP for EMEA and Latin America leading the Yahoo! Research labs at Barcelona, Spain and Santiago, Chile, supervising also the lab in Haifa, Israel. His Ph.D...

and Gaston Gonnet

Gaston Gonnet

Gaston H. Gonnet is a Uruguayan computer scientist and entrepreneur. He is best known for his contributions to the Maple computer algebra system and the creation of an electronic version of the Oxford English Dictionary.- Education and professional life :...

. The algorithm was improved by Baeza-Yates and Navarro in 1996 and later by Gene Myers for long patterns in 1998.

Exact searching

The bitap algorithm for exact string searching

String searching algorithm

String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text....

, in full generality, looks like this when implemented in C

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

:

#include
#include

typedef char BIT; /* needs only to hold the values 0 and 1 */

const char *bitap_search(const char *text, const char *pattern)
{
const char *result = NULL;
int m = strlen(pattern);
BIT *R;
int i, k;

if (pattern[0] '\0') return text;

/* Initialize the bit array R */
R = malloc((m+1) * sizeof *R);
R[0] = 1;
for (k=1; k <= m; ++k)
R[k] = 0;

for (i=0; text[i] != '\0'; ++i) {
/* Update the bit array. */
for (k=m; k >= 1; --k)
R[k] = R[k-1] && (text[i] pattern[k-1]);

if (R[m]) {
result = (text+i - m) + 1;
break;
}
}

free(R);
return result;
}

Bitap distinguishes itself from other well-known string searching algorithms in its natural mapping onto simple bitwise operations, as in the following modification of the above program. Notice that in this implementation, counterintuitively, each bit with value zero indicates a match, and each bit with value 1 indicates a non-match. The same algorithm can be written with the intuitive semantics for 0 and 1, but in that case we must introduce another instruction into the inner loop

Inner loop

In computer programs, an important form of control flow is the loop. For example, this small pseudo-code program uses two nested loops to iterate over all the entries of an n×n matrix, changing their values so that the matrix becomes an identity matrix: for a in 1..n for b in 1..n ...

to set R |= 1. In this implementation, we take advantage of the fact that left-shifting a value shifts in zeros on the right, which is precisely the behavior we need.

Notice also that we require CHAR_MAX additional bitmasks in order to convert the (text[i] pattern[k-1]) condition in the general implementation into bitwise operations. Therefore, the bitap algorithm performs better when applied to inputs over smaller alphabets.

#include
#include

const char *bitap_bitwise_search(const char *text, const char *pattern)
{
int m = strlen(pattern);
unsigned long R;
unsigned long pattern_mask[CHAR_MAX+1];
int i;

if (pattern[0] '\0') return text;
if (m > 31) return "The pattern is too long!";

/* Initialize the bit array R */
R = ~1;

/* Initialize the pattern bitmasks */
for (i=0; i <= CHAR_MAX; ++i)
pattern_mask[i] = ~0;
for (i=0; i < m; ++i)
pattern_mask[pattern[i]] &= ~(1UL << i);

for (i=0; text[i] != '\0'; ++i) {
/* Update the bit array */
R |= pattern_mask[text[i]];
R <<= 1;

if (0

(R & (1UL << m)))
return (text + i - m) + 1;
}

return NULL;
}

Fuzzy searching
To perform fuzzy string searching using the bitap algorithm, it is necessary to extend the bit array R into a second dimension. Instead of having a single array R that changes over the length of the text, we now have k distinct arrays R_1..k. Array R_i holds a representation of the prefixes of pattern that match any suffix of the current string with i or fewer errors. In this context, an "error" may be an insertion, deletion, or substitution; see Levenshtein distance

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

for more information on these operations.

The implementation below performs fuzzy matching

Fuzzy matching

Fuzzy matching is a technique used in computer-assisted translation and some other information technology applications such as record linkage. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous...

(returning the first match with up to k errors) using the fuzzy bitap algorithm. However, it only pays attention to substitutions, not to insertions or deletions — in other words, a Hamming distance

Hamming distance

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different...

of k. As before, the semantics of 0 and 1 are reversed from their intuitive meanings.

#include
#include
#include

const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k)
{
const char *result = NULL;
int m = strlen(pattern);
unsigned long *R;
unsigned long pattern_mask[CHAR_MAX+1];
int i, d;

if (pattern[0] '\0') return text;
if (m > 31) return "The pattern is too long!";

/* Initialize the bit array R */
R = malloc((k+1) * sizeof *R);
for (i=0; i <= k; ++i)
R[i] = ~1;

/* Initialize the pattern bitmasks */
for (i=0; i <= CHAR_MAX; ++i)
pattern_mask[i] = ~0;
for (i=0; i < m; ++i)
pattern_mask[pattern[i]] &= ~(1UL << i);

for (i=0; text[i] != '\0'; ++i) {
/* Update the bit arrays */
unsigned long old_Rd1 = R[0];

R[0] |= pattern_mask[text[i]];
R[0] <<= 1;

for (d=1; d <= k; ++d) {
unsigned long tmp = R[d];
/* Substitution is all we care about */
R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
old_Rd1 = tmp;
}

if (0

(R[k] & (1UL << m))) {
result = (text+i - m) + 1;
break;
}
}

free(R);
return result;
}

External links and references

Bálint Dömölki, An algorithm for syntactical analysis, Computational Linguistics 3, Hungarian Academy of Science pp. 29–46, 1964.
Bálint Dömölki, A universal compiler system based on production rules, BIT Numerical Mathematics, 8(4), pp 262–275, 1968. doi:10.1007/BF01933436
R. K. Shyamasundar, Precedence parsing using Dömölki's algorithm, International Journal of Computer Mathematics, 6(2)pp 105–114, 1977
Udi Manber, Sun Wu. "Fast text searching with errors." Technical Report TR-91-11. Department of Computer Science, University of Arizona
University of Arizona
The University of Arizona is a land-grant and space-grant public institution of higher education and research located in Tucson, Arizona, United States. The University of Arizona was the first university in the state of Arizona, founded in 1885...

, Tucson, June 1991. ([ftp://ftp.cs.arizona.edu/agrep/agrep.ps.1.Z gzipped PostScript])
Udi Manber, Sun Wu. "Fast text search allowing errors." Communications of the ACM
Communications of the ACM
Communications of the ACM is the flagship monthly journal of the Association for Computing Machinery . First published in 1957, CACM is sent to all ACM members, currently numbering about 80,000. The articles are intended for readers with backgrounds in all areas of computer science and information...

, 35(10): pp. 83–91, October 1992, doi:10.1145/135239.135244.
Ricardo A. Baeza-Yates, Gastón H. Gonnet. "A New Approach to Text Searching." Communications of the ACM, 35(10): pp. 74–82, October 1992.
R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 1–23, Irvine, CA, June 1996.
G. Myers. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." Journal of the ACM 46 (3), May 1999, 395–415.
Libbitap, a free implementation that shows how the algorithm can easily be extended for most regular expressions. Unlike the code above, it places no limit on the pattern length.
Baeza-Yates. Modern Information Retrieval. ISBN 0-201-39829-X.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Exact searching

(R & (1UL << m))) return (text + i - m) + 1; } return NULL; }

(R[k] & (1UL << m))) { result = (text+i - m) + 1; break; } } free(R); return result; }

(R & (1UL << m)))
return (text + i - m) + 1;
}

return NULL;
}

(R[k] & (1UL << m))) {
result = (text+i - m) + 1;
break;
}
}

free(R);
return result;
}