Maximal munch
Encyclopedia
In computer programming
Computer programming
Computer programming is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a program that performs specific operations or exhibits a...

 and computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed. It is similar, though not entirely identical, to the concept of a greedy algorithm
Greedy algorithm
A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....

.

Application

For instance, the lexical syntax
Lexical grammar
In computer science, a lexical grammar can be thought of as the syntax of tokens. That is, the rules governing how a character sequence is divided up into subsequences of characters, each part of which represents an individual token....

 of many programming languages requires that tokens be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s such as [a-z]+ (one or more lower-case letters).

The term has also been used in the field of compilers to describe a process of "tiling" — determining how a structured tree representing a program in an intermediate language
Intermediate language
In computer science, an intermediate language is the language of an abstract machine designed to aid in the analysis of computer programs. The term comes from their use in compilers, where a compiler first translates the source code of a program into a form more suitable for code-improving...

 should be converted into linear machine code
Machine code
Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...

. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles," each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch."

Drawbacks

In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For example, C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 uses the "angle bracket" characters < and > in the syntax for template specialization
Template (programming)
Templates are a feature of the C++ programming language that allow functions and classes to operate with generic types. This allows a function or class to work on many different data types without being rewritten for each one....

, but two consecutive > characters are interpreted as the right-shift operator >>. Prior to C++11, the following code would produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens:


// This is not valid syntax in standard C++.
std::vector> incorrect;
// Adding whitespace between the brackets yields the correct parse.
std::vector > correct;


The C++11 standard adopted in August 2011 amended the grammar so that a right-shift token is accepted as synonymous with a pair of right-angle brackets, (as in Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

,) which complicates the grammar but allows the continued use of the maximal munch principle.

Alternatives

Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions," which instead of directly taking the longest match will put some restrictions on what characters can follow a valid match. For example, stipulating that strings matching [a-z]+ cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression. Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (e.g., the right-shift token in Java would not be matched in the context of a template expression, where it is syntactically invalid).
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK