Parsing expression grammar - AbsoluteAstronomy.com

A parsing expression grammar, or PEG, is a type of analytic formal grammar

Formal grammar

A formal grammar is a set of formation rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax...

, i.e. it describes a formal language

Formal language

A formal language is a set of words—that is, finite strings of letters, symbols, or tokens that are defined in the language. The set from which these letters are taken is the alphabet over which the language is defined. A formal language is often defined by means of a formal grammar...

in terms of a set of rules for recognizing strings

String (computer science)

In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing language

Top-down parsing language

Top-Down Parsing Language is a type of analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top-down parsers that support a limited form of backtracking...

s introduced in the early 1970s.
Syntactically, PEGs also look similar to context-free grammar

Context-free grammar

In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....

s (CFGs), but they have a different interpretation: the rules of PEGs are ordered. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser

Recursive descent parser

A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures where each such procedure usually implements one of the production rules of the grammar...

.

Unlike CFGs, PEGs cannot be ambiguous

Ambiguous grammar

In computer science, a context-free grammar is said to be an ambiguous grammar if there exists a string that can be generated by the grammar in more than one way...

; if a string parses, it has exactly one valid parse tree

Parse tree

A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...

. This makes PEGs well-suited to parsing computer languages, but not natural language

Natural language

In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

Syntax

Formally, a parsing expression grammar consists of:

A finite set N of nonterminal symbols.
A finite set Σ of terminal symbols that is disjoint from N.
A finite set P of parsing rules.
An expression e_S termed the starting expression.

Each parsing rule in P has the form A ← e, where A is a nonterminal symbol and e is a parsing expression. A parsing expression is a hierarchical expression

Expression (mathematics)

In mathematics, an expression is a finite combination of symbols that is well-formed according to rules that depend on the context. Symbols can designate numbers , variables, operations, functions, and other mathematical symbols, as well as punctuation, symbols of grouping, and other syntactic...

similar to a regular expression

Regular expression

In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

, which is constructed in the following fashion:

An atomic parsing expression consists of:
- any terminal symbol,
- any nonterminal symbol, or
- the empty string ε.
Given any existing parsing expressions e, e₁, and e₂, a new parsing expression can be constructed using the following operators:
- Sequence: e₁ e₂
- Ordered choice: e₁ / e₂
- Zero-or-more: e*
- One-or-more: e+
- Optional: e?
- And-predicate: &e
- Not-predicate: !e

Semantics

The fundamental difference between context-free grammars and parsing expression grammars is that the PEG's choice operator is ordered. If the first alternative succeeds, the second alternative is ignored. Thus ordered choice is not commutative

Commutativity

In mathematics an operation is commutative if changing the order of the operands does not change the end result. It is a fundamental property of many binary operations, and many mathematical proofs depend on it...

, unlike unordered choice as in context-free grammars and regular expressions. Ordered choice is analogous to soft cut

Cut (logic programming)

The cut, in Prolog, is a goal, written as !, which always succeeds, but cannot be backtracked past. It is best used to prevent unwanted backtracking, for example, to prevent extra solutions being found by Prolog and avoid additional computations that are not desired or required in a program.The cut...

operators available in some logic programming

Logic programming

Logic programming is, in its broadest sense, the use of mathematical logic for computer programming. In this view of logic programming, which can be traced at least as far back as John McCarthy's [1958] advice-taker proposal, logic is used as a purely declarative representation language, and a...

languages.

The consequence is that if a CFG is transliterated directly to a PEG, any ambiguity in the former is resolved by deterministically picking one parse tree from the possible parses. By carefully choosing the order in which the grammar alternatives are specified, a programmer has a great deal of control over which parse tree is selected.

Like boolean context-free grammars

Boolean grammar

Boolean grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with conjunction and negation operations...

, parsing expression grammars also add the and- and not- predicates. These facilitate further disambiguation, if reordering the alternatives cannot specify the exact parse tree desired.

Operational interpretation of parsing expressions

Each nonterminal in a parsing expression grammar essentially represents a parsing function

Function (mathematics)

In mathematics, a function associates one quantity, the argument of the function, also known as the input, with another quantity, the value of the function, also known as the output. A function assigns exactly one output to each input. The argument and the value may be real numbers, but they can...

in a recursive descent parser

Recursive descent parser

A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures where each such procedure usually implements one of the production rules of the grammar...

, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:

success, in which the function may optionally move forward or consume one or more characters of the input string supplied to it, or
failure, in which case no input is consumed.

A nonterminal may succeed without actually consuming any input, and this is considered an outcome distinct from failure.

An atomic parsing expression consisting of a single terminal succeeds if the first character of the input string matches that terminal, and in that case consumes the input character; otherwise the expression yields a failure result. An atomic parsing expression consisting of the empty string always trivially succeeds without consuming any input. An atomic parsing expression consisting of a nonterminal A represents a recursive

Recursion

Recursion is the process of repeating items in a self-similar way. For instance, when the surfaces of two mirrors are exactly parallel with each other the nested images that occur are a form of infinite recursion. The term has a variety of meanings specific to a variety of disciplines ranging from...

call to the nonterminal-function A.

The sequence operator e₁ e₂ first invokes e₁, and if e₁ succeeds, subsequently invokes e₂ on the remainder of the input string left unconsumed by e₁, and returns the result. If either e₁ or e₂ fails, then the sequence expression e₁ e₂ fails.

The choice operator e₁ / e₂ first invokes e₁, and if e₁ succeeds, returns its result immediately. Otherwise, if e₁ fails, then the choice operator backtracks

Backtracking

Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...

to the original input position at which it invoked e₁, but then calls e₂ instead, returning e₂'s result.

The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammar

Context-free grammar

s and regular expression

Regular expression

s, however, these operators always behave greedily

Greedy algorithm

A greedy algorithm is any algorithm that follows the problem solving heuristic of making the locally optimal choice at each stagewith the hope of finding the global optimum....

, consuming as much input as possible and never backtracking. (Regular expression matchers may start by matching greedily, but will then backtrack and try shorter matches if they fail to match.) For example, the expression a* will always consume as many a's as are consecutively available in the input string, and the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.

Finally, the and-predicate and not-predicate operators implement syntactic predicate

Syntactic predicate

A syntactic predicate specifies the syntactic validity of applying a production in a formal grammar and is analogous to a semantic predicate that specifies the semantic validity of applying a production. It is a simple and effective means of dramatically improving the recognition strength of an LL...

s. The expression &e invokes the sub-expression e, and then succeeds if e succeeds and fails if e fails, but in either case never consumes any input. Conversely, the expression !e succeeds if e fails and fails if e succeeds, again consuming no input in either case. Because these can use an arbitrarily complex sub-expression e to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead

Lookahead

Lookahead is a tool in algorithms for looking ahead a few more input items before making a cost effective decision at one stage of the algorithm.- Lookahead in search problems :...

and disambiguation facility.

Examples

This is a PEG that recognizes mathematical formulas that apply the basic four operations to non-negative integers.



Value   ← [0-9]+ / '(' Expr ')'

Product ← Value (('*' / '/') Value)*

Sum     ← Product (('+' / '-') Product)*

Expr    ← Sum

In the above example, the terminal symbols are characters of text, represented by characters in single quotes, such as '(' and ')'. The range [0-9] is also a shortcut for ten characters, indicating any one of the digits 0 through 9. (This range syntax is the same as the syntax used by regular expression

Regular expression

s.) The nonterminal symbols are the ones that expand to other rules: Value, Product, Sum, and Expr.

The examples below drop quotation marks in order to be easier to read. Lowercase letters are terminal symbols, while capital letters in italics are nonterminals. Actual PEG parsers would require the lowercase letters to be in quotes.

The parsing expression (a/b)* matches and consumes an arbitrary-length sequence of a's and b's. The rule S ← a S? b describes the simple context-free

Context-free language

In formal language theory, a context-free language is a language generated by some context-free grammar. The set of all context-free languages is identical to the set of languages accepted by pushdown automata.-Examples:...

"matching language"

. The following parsing expression grammar describes the classic non-context-free language



S ← &(A c) a+ B !(a/b/c)

A ← a A? b

B ← b B? c

The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar

Context-free grammar

, this construct yields the classic dangling else

Dangling else

The dangling else is a problem in computer programming in which a seemingly well-defined statement can become ambiguous. In many programming languages one may write conditionally executed code in two forms: the if-then form, and the if-then-else form:...

ambiguity.)



S ← if C then S else S / if C then S

The parsing expression foo &(bar) matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression foo !(bar) matches the text "foo" but only if it is not followed by the text "bar". The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.

The following recursive rule matches Pascal-style nested comment syntax, (* which can (* nest *) like this *). The comment symbols appear in double quotes to distinguish them from PEG operators.



Begin ← "(*"

End   ← "*)"

C     ← Begin N* End

N     ← C / (!Begin !End Z)

Z     ← any single character

Implementing parsers from parsing expression grammars

Any parsing expression grammar can be converted directly into a recursive descent parser

Recursive descent parser

A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures where each such procedure usually implements one of the production rules of the grammar...

. Due to the unlimited lookahead

Lookahead

Lookahead is a tool in algorithms for looking ahead a few more input items before making a cost effective decision at one stage of the algorithm.- Lookahead in search problems :...

capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.

It is possible to obtain better performance for any parsing expression grammar by converting its recursive descent parser into a packrat parser, which always runs in linear time, at the cost of substantially greater storage space requirements. A packrat parser
is a form of parser

Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

similar to a recursive descent parser in construction, except that during the parsing process it memoizes

Memoization

In computing, memoization is an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously processed inputs...

the intermediate results of all invocations of the mutually recursive

Mutual recursion

Mutual recursion is a form of recursion where two mathematical or computational functions are defined in terms of each other.For instance, consider two functions even? and odd? defined as follows: function even?...

parsing functions, ensuring that each parsing function is only invoked at most once at a given input position. Because of this memoization, a packrat parser has the ability to parse many context-free grammar

Context-free grammar

s and any parsing expression grammar (including some that do not represent context-free languages) in linear time. Examples of memoized recursive descent parsers are known from at least as early as 1993.
Note that this analysis of the performance of a packrat parser assumes that enough memory is available to hold all of the memoized results; in practice, if there were not enough memory, some parsing functions might have to be invoked more than once at the same input position, and consequently the parser could take more than linear time.

It is also possible to build LL parser

LL parser

An LL parser is a top-down parser for a subset of the context-free grammars. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence...

s and LR parser

LR parser

In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...

s from parsing expression grammars, with better worst-case performance than a recursive descent parser, but the unlimited lookahead capability of the grammar formalism is then lost. Therefore, not all languages that can be expressed using parsing expression grammars can be parsed by LL or LR parsers.

Advantages

Compared to pure regular expressions (i.e. without back-references), PEGs are strictly more powerful, but require significantly more memory. For example, a regular expression inherently cannot find an arbitrary number of matched pairs of parentheses, because it is not recursive, but a PEG can. However, a PEG will require an amount of memory proportional to the length of the input, while a regular expression matcher will require only a constant amount of memory.

Any PEG can be parsed in linear time by using a packrat parser, as described above.

Parsers for languages expressed as a CFG, such as LR parsers, require a separate tokenization

Tokenization

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining...

step to be done first, which breaks up the input based on the location of spaces, punctuation, etc. The tokenization is necessary because of the way these parsers use lookahead to parse CFGs that meet certain requirements in linear time. PEGs do not require tokenization to be a separate step, and tokenization rules can be written in the same way as any other grammar rule.

Many CFGs contain ambiguities, even when they're intended to describe unambiguous languages. The "dangling else

Dangling else

" problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In a PEG, these ambiguities never arise, because of prioritization.

Memory consumption

PEG parsing is typically carried out via packrat parsing, which uses memoization

Memoization

to eliminate redundant parsing steps. Packrat parsing requires storage proportional to the total input size, rather than the depth of the parse tree as with LR parsers. This is a significant difference in many domains: for example, hand-written source code has an effectively constant expression nesting depth independent of the length of the program—expressions nested beyond a certain depth tend to get refactored.

For some grammars and some inputs, the depth of the parse tree can be proportional to the input size,

so both an LR Parser and a packrat parser will appear to have the same worst-case asymptotic performance. A more accurate analysis would take the depth of the parse tree into account separately from the input size. This is similar to a situation which arises in graph algorithms: the Bellman–Ford algorithm and Floyd–Warshall algorithm appear to have the same running time (

) if only the number of vertices is considered. However, a more accurate analysis which accounts for the number of edges as a separate parameter assigns the Bellman–Ford algorithm a time of

, which is only quadratic in the size of the input (rather than cubic).

Indirect left recursion

PEGs cannot express left-recursive

Left recursion

In computer science, left recursion is a special case of recursion.In terms of context-free grammar, a non-terminal r is left-recursive if the left-most symbol in any of r’s ‘alternatives’ either immediately or through some other non-terminal definitions rewrites to r again.- Definition :"A...

rules where a rule refers to itself without moving forward in the string. For example, in the arithmetic grammar above, it would be tempting to move some rules around so that the precedence order of products and sums could be expressed in one line:



Value   ← [0-9.]+ / '(' Expr ')'

Product ← Expr (('*' / '/') Expr)*

Sum     ← Expr (('+' / '-') Expr)*

Expr    ← Product / Sum / Value

In this new grammar matching an Expr requires testing if a Product matches while matching a Product requires testing if an Expr matches. This circular definition

Circular definition

A circular definition is one that uses the term being defined as a part of the definition or assumes a prior understanding of the term being defined. Either the audience must already know the meaning of the key term, or the definition is deficient in including the term to be defined in the...

cannot be resolved. However, left-recursive rules can always be rewritten to eliminate left-recursion. For example, the following left-recursive CFG rule:



string-of-a ← string-of-a 'a' | 'a'

can be rewritten in a PEG using the plus operator:



string-of-a ← 'a'+

The process of rewriting indirectly left-recursive rules is complex in some packrat parsers, especially when semantic actions are involved.

With some modification, traditional packrat parsing can support direct left recursion.
, but doing so results in a loss of the linear-time parsing property which is generally the justification for using PEGs and Packrat Parsing in the first place. Only the OMeta parsing algorithm supports full direct and indirect left recursion without additional attendant complexity (but again, at a loss of the linear time complexity), whereas all GLR parser

GLR parser

A GLR parser is an extension of an LR parser algorithm to handle nondeterministic and ambiguous grammars. First described in a 1984 paper by Masaru Tomita, it has also been referred to as a "parallel parser"...

s support left recursion.

Subtle grammatical errors

In order to express a grammar as a PEG, the grammar author must convert all instances of nondeterministic choice into prioritized choice. Unfortunately, this process is error prone, and often results in grammars which mis-parse certain inputs. An example and commentary can be found here.

Expressive power

Packrat parsers cannot recognize some unambiguous grammars, such as the following (example taken from)



S ← 'x' S 'x' | 'x'

In fact, neither LL(k) nor LR(k) parsing algorithms are capable of recognizing this example.

Maturity

PEGs are new and not widely implemented. By contrast, regular expressions and CFGs have been around for decades, the code to parse them has been extensively optimized, and many programmers are familiar with how to use them. This is not all that compelling an objection to the adoption and use of PEGs, of course, since there was a time when the same considerations applied to the older approaches as well.

External links

Parsing Expression Grammars: A Recognition-Based Syntactic Foundation (PDF slides)
The Packrat Parsing and Parsing Expression Grammars Page
Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking
The constructed language
Constructed language
A planned or constructed language—known colloquially as a conlang—is a language whose phonology, grammar, and/or vocabulary has been consciously devised by an individual or group, instead of having evolved naturally...

Lojban
Lojban
See also discussed by Arthur Protin, Bob LeChevalier, Carl Burke, Doug Landauer, Guy Steele, Jack Waugh, Jeff Prothero, Jim Carter, and Robert Chassell, as well as , the concepts which "average English speakers won't recognize" because most of them "have no exact English counterpart".Like most...

has a fairly large PEG grammar allowing unambiguous parsing of Lojban text.
The Aurochs parser generator has an on-line parsing demo that displays the parse tree for any given grammar and input
REBOL parse dialect is PEG-compatible
Parboiled is a mixed Java/Scala library for building PEG-based parsers
Mouse transcribes PEG into a set of recursive Java procedures

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Syntax

Semantics

Operational interpretation of parsing expressions

Examples

Implementing parsers from parsing expression grammars

Advantages

Memory consumption

Indirect left recursion

Subtle grammatical errors

Expressive power

Maturity

See also

External links