Pattern matching
Encyclopedia
In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern
Pattern
A pattern, from the French patron, is a type of theme of recurring events or objects, sometimes referred to as elements of a set of objects.These elements repeat in a predictable manner...

. In contrast to pattern recognition
Pattern recognition
In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

, the match usually has to be exact. The patterns generally have the form of either sequences
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

 or tree structure
Tree structure
A tree structure is a way of representing the hierarchical nature of a structure in a graphical form. It is named a "tree structure" because the classic representation resembles a tree, even though the chart is generally upside down compared to an actual tree, with the "root" at the top and the...

s. Uses of pattern matching include outputting the locations (if any) of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence (i.e., search and replace).

Sequence patterns (e.g., a text string) are often described using regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s and matched using techniques such as backtracking
Backtracking
Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...

.

Tree patterns are used in some programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

s as a general tool to process data based on its structure, e.g., Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...

, ML
ML programming language
ML is a general-purpose functional programming language developed by Robin Milner and others in the early 1970s at the University of Edinburgh, whose syntax is inspired by ISWIM...

 and the symbolic mathematics language Mathematica
Mathematica
Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...

 have special syntax for expressing tree patterns and a language construct
Language construct
A language construct is a syntactically allowable part of a program that may be formed from one or more lexical tokens in accordance with the rules of a programming language....

 for conditional execution
Conditional statement
In computer science, conditional statements, conditional expressions and conditional constructs are features of a programming language which perform different computations or actions depending on whether a programmer-specified boolean condition evaluates to true or false...

 and value retrieval based on it. For simplicity and efficiency reasons, these tree patterns lack some features that are available in regular expressions.

Often it is possible to give alternative patterns that are tried one by one, which yields a powerful conditional programming construct
Conditional statement
In computer science, conditional statements, conditional expressions and conditional constructs are features of a programming language which perform different computations or actions depending on whether a programmer-specified boolean condition evaluates to true or false...

. Pattern matching sometimes include support for guards
Guard (computing)
In computer programming, a guard is a boolean expression that must evaluate to true if the program execution is to continue in the branch in question. The term is used at least in Haskell, Clean, Erlang, occam, Promela, OCaml and Scala programming languages. In Mathematica, guards are called...

.

Term rewriting and Graph rewriting
Graph rewriting
Graph transformation, or Graph rewriting, concerns the technique of creating a new graph out of an original graph using some automatic machine. It has numerous applications, ranging from software verification to layout algorithms....

 languages rely on pattern matching for the fundamental way a program evaluates into a result.

History

The first computer programs to use pattern matching were text editors. At Bell Labs
Bell Labs
Bell Laboratories is the research and development subsidiary of the French-owned Alcatel-Lucent and previously of the American Telephone & Telegraph Company , half-owned through its Western Electric manufacturing subsidiary.Bell Laboratories operates its...

, Ken Thompson extended the seeking and replacing features of the QED editor
QED (text editor)
QED is a line-oriented computer text editor that was developed by Butler Lampson and L. Peter Deutsch for the Berkeley Timesharing System running on the SDS 940. It was implemented by L...

 to accept regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s. Early programming languages with pattern matching constructs include SNOBOL
SNOBOL
SNOBOL is a generic name for the computer programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky, culminating in SNOBOL4...

 from 1962, SASL
SASL programming language
SASL is a purely functional programming language developed by David Turner at the University of St Andrews in 1972, based on the applicative subset of ISWIM. In 1976 Turner redesigned and reimplemented it as a non-strict language...

 from 1976, NPL
NPL programming language
NPL was a functional language with pattern matching designed by Rod Burstall and John Darlington in 1977. The language allowed certain sets and logic constructs to appear on the right hand side of definitions, E.g. setofeven...

 from 1977, and KRC
Kent Recursive Calculator
KRC is a lazy functional language developed by David Turner in 1981 based on SASL, with pattern matching, guards and ZF expressions ....

 from 1981. The first programming language with tree-based pattern matching features was Fred McBride's extension of LISP, in 1970.

Primitive patterns

The simplest pattern in pattern matching is an explicit value or a variable. For an example, consider a simple function definition in Haskell syntax (function parameters are not in parentheses but are separated by spaces, = is not assignment but definition):


f 0 = 1


Here, 0 is a single value pattern. Now, whenever f is given 0 as argument the pattern matches and the function returns 1. With any other argument, the matching and thus the function fail. As the syntax supports alternative patterns in function definitions, we can continue the definition extending it to take more generic arguments:


f n = n * f (n-1)


Here, the first n is a single variable pattern, which will match absolutely any argument and bind it to name n to be used in the rest of the definition. In Haskell (unlike at least Hope
Hope programming language
Hope is a small functional programming language developed in the 1970s at Edinburgh University. It predates Miranda and Haskell and is contemporaneous with ML . It is notable for being the first language with call-by-pattern evaluation and algebraic data types...

), patterns are tried in order so the first definition still applies in the very specific case of the input being 0, while for any other argument the function returns n * f (n-1) with n being the argument.

The wildcard pattern (often written as _) is also simple: like a variable name, it matches any value, but does not bind the value to any name.

Tree patterns

More complex patterns can be built from the primitive ones of the previous section, usually in the same way as values are built by combining other values. The difference then is that with variable and wildcard parts, a pattern doesn't build into single value, but matches a group of values that are the combination of the concrete elements and the elements that are allowed to vary within the structure of the pattern.

A tree pattern describes a part of a tree by starting with a node and specifying some branches and nodes and leaving some unspecified with a variable or wildcard pattern. It may help to think of the abstract syntax tree
Abstract syntax tree
In computer science, an abstract syntax tree , or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is 'abstract' in the sense that it...

 of a programming language and algebraic data type
Algebraic data type
In computer programming, particularly functional programming and type theory, an algebraic data type is a datatype each of whose values is data from other datatypes wrapped in one of the constructors of the datatype. Any wrapped datum is an argument to the constructor...

s.

In Haskell, the following line defines an algebraic data type Color that has a single data constructor ColorConstructor that wraps an integer and a string.


data Color = ColorConstructor Integer String


The constructor is a node in a tree and the integer and string are leaves in branches.

When we want to write functions to make Color an abstract data type
Abstract data type
In computing, an abstract data type is a mathematical model for a certain class of data structures that have similar behavior; or for certain data types of one or more programming languages that have similar semantics...

, we wish to write functions to interface
Interface (computer science)
In the field of computer science, an interface is a tool and concept that refers to a point of interaction between components, and is applicable at the level of both hardware and software...

 with the data type, and thus we want to extract some data from the data type, for example, just the string or just the integer part of Color.

If we pass a variable that is of type Color, how can we get the data out of this variable? For example, for a function to get the integer part of Color, we can use a simple tree pattern and write:


integerPart (ColorConstructor theInteger _) = theInteger


As well:

stringPart (ColorConstructor _ theString) = theString


The creations of these functions can be automated by Haskell's data record syntax.

Filtering data with patterns

Pattern matching can be used to filter data of a certain structure. For instance, in Haskell a list comprehension could be used for this kind of filtering:


[A x|A x <- [A 1, B 1, A 2, B 2]]


evaluates to
[A 1, A 2]

Pattern matching in Mathematica

In Mathematica
Mathematica
Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...

, the only structure that exists is the tree
Tree (data structure)
In computer science, a tree is a widely-used data structure that emulates a hierarchical tree structure with a set of linked nodes.Mathematically, it is an ordered directed tree, more specifically an arborescence: an acyclic connected graph where each node has zero or more children nodes and at...

, which is populated by symbols. In the Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...

 syntax used thus far, this could be defined as
data SymbolTree = Symbol String [SymbolTree]
An example tree could then look like
Symbol "a" [Symbol "b" [], Symbol "c" []]

In the traditional, more suitable syntax, the symbols are written as they are and the levels of the tree are represented using [], so that for instance a[b,c] is a tree with a as the parent, and b and c as the children.

A pattern in Mathematica involves putting "_" at positions in that tree. For instance, the pattern

A[_]

will match elements such as A[1], A[2], or more generally A[x] where x is any entity. In this case, A is the concrete element, while _ denotes the piece of tree that can be varied. A symbol prepended to _ binds the match to that variable name while a symbol appended to _ restricts the matches to nodes of that symbol.

The Mathematica function Cases filters elements of the first argument that match the pattern in the second argument:

Cases[{a[1], b[1], a[2], b[2]}, a[_] ]

evaluates to

{a[1], a[2]}

Pattern matching applies to the structure of expressions. In the example below,

Cases[ {a[b], a[b, c], a[b[c], d], a[b[c], d[e]], a[b[c], d, e]}, a[b[_], _] ]

returns

{a[b[c],d], a[b[c],d[e]]}

because only these elements will match the pattern a[b[_],_] above.

In Mathematica, it is also possible to extract structures as they are created in the course of computation, regardless of how or where they appear. The function Trace can be used to monitor a computation, and return the elements that arise which match a pattern. For example, we can define the Fibonacci sequence
Fibonacci number
In mathematics, the Fibonacci numbers are the numbers in the following integer sequence:0,\;1,\;1,\;2,\;3,\;5,\;8,\;13,\;21,\;34,\;55,\;89,\;144,\; \ldots\; ....

 as

fib[0|1]:=1
fib[n_]:= fib[n-1] + fib[n-2]

Then, we can ask the question: Given fib[3], what is the sequence of recursive Fibonacci calls?

Trace[fib[3], fib[_]]

returns a structure that represents the occurrences of the pattern fib[_] in the computational structure:

{fib[3],{fib[2],{fib[1]},{fib[0]}},{fib[1]}}

Declarative programming

In symbolic programming languages, it is easy to have patterns as arguments to functions or as elements of data structures. A consequence of this is the ability to use patterns to declaratively make statements about pieces of data and to flexibly instruct functions how to operate.

For instance, the Mathematica
Mathematica
Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...

 function Compile can be used to make more efficient versions of the code. In the following example the details do not particularly matter; what matters is that the subexpression instructs Compile that expressions of the form com[_] can be assumed to be integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

s for the purposes of compilation:

com[i_] := Binomial[2i, i]
Compile[{x, {i, _Integer}}, x^com[i], ]

Mailboxes in Erlang
Erlang programming language
Erlang is a general-purpose concurrent, garbage-collected programming language and runtime system. The sequential subset of Erlang is a functional language, with strict evaluation, single assignment, and dynamic typing. For concurrency it follows the Actor model. It was designed by Ericsson to...

 also work this way.

The Curry-Howard correspondence between proofs and programs relates ML-style pattern matching to case analysis
Case analysis
Case analysis is one of the most general and applicable methods of analytical thinking, depending only on the division of a problem, decision or situation into a sufficient number of separate cases. Analysing each such case individually may be enough to resolve the initial question...

 and proof by exhaustion
Proof by exhaustion
Proof by exhaustion, also known as proof by cases, perfect induction, or the brute force method, is a method of mathematical proof in which the statement to be proved is split into a finite number of cases and each case is checked to see if the proposition in question holds...

.

Pattern matching and strings

By far the most common form of pattern matching involves strings of characters. In many programming languages, a particular syntax of strings is used to represent regular expressions, which are patterns describing string characters.

However, it is possible to perform some string pattern matching within the same framework that has been discussed throughout this article.

Tree patterns for strings

In Mathematica, strings are represented as trees of root StringExpression and all the characters in order as children of the root. Thus, to match "any amount of trailing characters", a new wildcard ___ is needed in contrast to _ that would match only a single character.

In Haskell and functional programming languages in general, strings are represented as functional lists of characters. A functional list is defined as an empty list, or an element constructed on an existing list. In Haskell syntax:
[] -- an empty list
x:xs—an element x constructed on a list xs

The structure for a list with some elements is thus element:list. When pattern matching, we assert that a certain piece of data is equal to a certain pattern. For example, in the function:

head (element:list) = element


we assert that the first element of head's argument is called element, and the function returns this. We know that this is the first element because of the way lists are defined, a single element constructed onto a list. This single element must be the first. The empty list would not match the pattern at all, as an empty list does not have a head (the first element that is constructed).

In the example, we have no use for list, so we can disregard it, and thus write the function:

head (element:_) = element


The equivalent Mathematica transformation is expressed as

head[element , ]:=element

Example string patterns

In Mathematica, for instance,

StringExpression["a", ]

will match a string that has two characters and begins with "a".

The same pattern in Haskell:

['a', _]


Symbolic entities can be introduced to represent many different classes of relevant features of a string. For instance,

StringExpression[LetterCharacter, DigitCharacter]

will match a string that consists of a letter first, and then a number.

In Haskell, guards
Guard (computing)
In computer programming, a guard is a boolean expression that must evaluate to true if the program execution is to continue in the branch in question. The term is used at least in Haskell, Clean, Erlang, occam, Promela, OCaml and Scala programming languages. In Mathematica, guards are called...

 could be used to achieve the same matches:

[letter, digit] | isAlpha letter && isDigit digit


The main advantage of symbolic string manipulation is that it can be completely integrated with the rest of the programming language, rather than being a separate, special purpose subunit. The entire power of the language can be leveraged to built up the patterns themselves or analyze and transform the programs that contain them.

SNOBOL

SNOBOL (String Oriented Symbolic Language) is a computer programming language developed between 1962 and 1967 at AT&T
AT&T
AT&T Inc. is an American multinational telecommunications corporation headquartered in Whitacre Tower, Dallas, Texas, United States. It is the largest provider of mobile telephony and fixed telephony in the United States, and is also a provider of broadband and subscription television services...

 Bell Laboratories by David J. Farber
David J. Farber
David J. "Dave" Farber is a professor of Computer Science, noted for his major contributions to programming languages and computer networking. He is currently Distinguished Career Professor of Computer Science and Public Policy at the School of Computer Science, Heinz College, and Department of...

, Ralph E. Griswold and Ivan P. Polonsky.

SNOBOL4 stands apart from most programming languages by having patterns as a first-class data type
First-class object
In programming language design, a first-class citizen , in the context of a particular programming language, is an entity that can be constructed at run-time, passed as a parameter, returned from a subroutine, or assigned into a variable...

 (i.e. a data type whose values can be manipulated in all ways permitted to any other data type in the programming language) and by providing operators for pattern concatenation
Concatenation
In computer programming, string concatenation is the operation of joining two character strings end-to-end. For example, the strings "snow" and "ball" may be concatenated to give "snowball"...

 and alternation. Strings generated during execution can be treated as programs and executed.

SNOBOL was quite widely taught in larger US universities in the late 1960s and early 1970s and was widely used in the 1970s and 1980s as a text manipulation language in the humanities
Humanities
The humanities are academic disciplines that study the human condition, using methods that are primarily analytical, critical, or speculative, as distinguished from the mainly empirical approaches of the natural sciences....

.

Since SNOBOL's creation, newer languages such as Awk and Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

 have made string manipulation by means of regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s fashionable. SNOBOL4 patterns, however, subsume BNF
Backus–Naur form
In computer science, BNF is a notation technique for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols.It is applied wherever exact descriptions of...

 grammars, which are equivalent to Context-free grammar
Context-free grammar
In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....

s and more powerful than regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s

See also

  • AIML
    AIML
    AIML, or Artificial Intelligence Markup Language, is an XML dialect for creating natural language software agents.- Background :The XML dialect called AIML was developed by Richard Wallace and a worldwide free software community between the years of 1995 and 2002...

     for an AI language based on matching patterns in speech
  • AWK language
  • Coccinelle
    Coccinelle (software)
    Coccinelle is a tool to match and transform the source code of programs written in the programming language C. Coccinelle was initially used to aid the evolution of Linux; with support for changes to library application programming interfaces such as renaming a function, adding a function...

     pattern matches C source code
  • glob (programming)
  • Pattern recognition
    Pattern recognition
    In machine learning, pattern recognition is the assignment of some sort of output value to a given input value , according to some specific algorithm. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes...

     for fuzzy patterns
  • PCRE Perl Compatible Regular Expressions, a common modern implementation of string pattern matching ported to many languages
  • REBOL parse dialect for pattern matching used to implement language dialects
  • Tom (pattern matching language)
    Tom (pattern matching language)
    Tom is a programming language particularly well-suited for programming various transformations on tree structures and XML based documents. Tom is a language extension which adds new matching primitives to C and Java as well as support for rewrite rules systems...

  • SNOBOL
    SNOBOL
    SNOBOL is a generic name for the computer programming languages developed between 1962 and 1967 at AT&T Bell Laboratories by David J. Farber, Ralph E. Griswold and Ivan P. Polonsky, culminating in SNOBOL4...

     for a programming language based on one kind of pattern matching
  • Unification, a similar concept in Prolog
    Prolog
    Prolog is a general purpose logic programming language associated with artificial intelligence and computational linguistics.Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is declarative: the program logic is expressed in terms of...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK