Perl Compatible Regular Expression - AbsoluteAstronomy.com

Perl Compatible Regular Expressions (PCRE) is a regular expression

Regular expression

In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

library

Library (computer science)

In computer science, a library is a collection of resources used to develop software. These may include pre-written code and subroutines, classes, values or type specifications....

inspired by Perl

Perl

Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

's external interface, written by Philip Hazel

Philip Hazel

Philip Hazel is a computer programmer best known for writing the Exim mail transport agent and the PCRE regular expression library. He was employed by the University of Cambridge Computing Service until he retired at the end of September 2007...

. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and many classic regular expression libraries. The name is misleading, because PCRE is Perl-compatible only if you consider a subset of PCRE's settings and a subset of Perl's regular expression facilities.

The PCRE library is incorporated into a number of prominent open-source programs, such as the Apache HTTP Server

Apache HTTP Server

The Apache HTTP Server, commonly referred to as Apache , is web server software notable for playing a key role in the initial growth of the World Wide Web. In 2009 it became the first web server software to surpass the 100 million website milestone...

and the PHP

PHP

PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

and R

R (programming language)

R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

scripting languages; and can be incorporated in proprietary softwares too (BSD license). As of Perl 5.10, PCRE is also available as a replacement for Perl's default regular expression engine through the re::engine::PCRE module.

The library can be built using configure

Configure script (computing)

Developing a program to be run on a wide number of different computers is a complex task. A Configure script matches the libraries on the user's computer, with those required by the program, just before compiling it from its source code....

and make (typical of Unix-like environments), as well as in Unix, Windows and other environments using CMake

CMake

CMake is a cross-platform, open-source system for managing the build process of software using a compiler-independent method. It is designed to support directory hierarchies and applications that depend on multiple libraries, and for use in conjunction with native build environments such as Make,...

. Numerous default settings are elected at build time. In addition to the PCRE library, a POSIX C wrapper, a Google-contributed native C++

C++

C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

wrapper, several test programs, and the utility program pcregrep are also included in the distribution and are built in tandem with the library. The PCRE library provides matching only; the C++ wrapper, if used, adds multiple match and replacement functionality.

Unless the "NoRecurse" PCRE build option (aka "--disable-stack-for-recursion") is chosen, adequate stack space must be allocated to PCRE by the calling application or operating system. The amount of stack needed varies for each pattern. For example, to complete the tests provided with pcretest, 8 mb of stack space would be needed. While PCRE's documentation cautions that the "NoRecurse" build option makes PCRE slower than the alternative, using it avoids entirely the issue of stack overflows.

Features

PCRE has developed an extensive and in some ways unique feature set. While it originally aimed at feature-equivalence with Perl, over time a number of features have been first implemented in PCRE and only much later added to Perl. During the PCRE 7.x and Perl 5.9.x (development track) phase the two projects have coordinated development and are to the extent possible feature equivalent. In some cases PCRE has included in mainline releases features that originated with Perl 5.9.x and in some cases Perl 5.9.x has included features that were previously only available in PCRE.

PCRE includes the following features:

Version 8.20 includes Zoltan Herczeg's just-in-time compiler support
if optionally enabled when the PCRE library is built. Large performance benefits are expected when (for example) the calling program utilizes the feature with compatible patterns that are executed repeatedly.

Consistent escaping rules: Like Perl, PCRE has consistent escaping rules: any non-alpha-numeric character may be escaped to mean its literal value by prefixing a \ (backslash) before the character, and vice versa, any alpha-numeric character preceded by a backslash typically gives it a special meaning. In the case where the sequence has not been defined to be special it will also be treated as a literal, however this usage is not forward compatible as new versions of PCRE may give such patterns a special meaning. A good example of this is \R which has no special meaning prior to PCRE 7. In POSIX regular expressions, sometimes backslashes escaped non-alpha-numerics (e.g. \.) and sometimes it introduced a special feature (e.g. ).
Extended character classes :Single-letter character classes are supported in addition to the longer POSIX names. For example \d matches any digit exactly as :digit: would in POSIX regular expressions.
Minimal matching (a.k.a “ungreedy”):A ? may be placed after any repeat count to indicate that the shortest match should be used. The default is to attempt the longest match

Maximal munch

In computer programming and computer science, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed...

first, and backtrack through shorter matches. e.g. "a.*?b" would match "ab" in "ababab", where "a.*b" would match the entire string.
Unicode character properties :Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

defines several properties for each character. Patterns in PCRE can match these properties. e.g. \p{Ps}.*?\p{Pe} would match a string beginning with any "opening punctuation" and ending with any "close punctuation" such as "[abc]". Since verion 8.10, matching of certain "normal" metacharacters can be driven by Unicode properties when the compile option PCRE_UCP is set. The option can be set for a pattern by including (*UCP) at the start of pattern. The option alters behavior of the following metacharacters: \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. For example, the characters matched by \w (word characters) is expanded to include letters and accented letters as defined by Unicode properties. Such matching is slower than the normal (ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

-only) non-UCP alternative. Note that the UCP option requires the PCRE library to have been built to include Unicode property support.
Multiline matching :^ and $ can match at the beginning and end of a string only, or at the start and end of each "line" within the string depending on what options are set.
Newline/linebreak options :When PCRE is compiled, a newline default is selected. Which newline/linebreak is in effect affects where PCRE detects ^-line beginnings and $-ends (in multiline mode) as well as what matches dot (regardless of multiline mode unless the dotall (?s) option is set). It also affects PCRE's matching procedure (since version 7.0): when an unanchored pattern fails to match at the start of a newline sequence, PCRE advances past the entire newline sequence before retrying the match. If the newline option alternative in effect includes CRLF as one of the valid linebreaks, it does not skip the \n in a CRLF if the pattern contains specific \r or \n references (since version 7.3). Since version 8.10, the metacharacter \N always matches any character other than linebreak characters. It has the same behavior as "." when the dotall option aka "(?s)" is not in effect.

The newline option can be altered with external options when a pattern is compiled as well as when it is run. Few application using PCRE provide users with the means to apply this setting via an external option. So, new in version 7.3, the newline option can also be stated at the start of the pattern using one of the following:

(*LF) Newline is a linefeed character. Corresponding linebreaks can be matched with \n.

(*CR) Newline is a carriage return. Corresponding linebreaks can be matched with \r.

(*CRLF) Newline/linebreak is a carriage return followed by a linefeed. Corresponding linebreaks can be matched with \r\n.

(*ANYCRLF) Any of the above encountered in the data will trigger newline processing. Corresponding linebreaks can be matched with (?>\r\n|[\r\n]) or with \R. See below for configuration and options concerning what matches Backslash-R.

(*ANY) Any of the above plus special Unicode linebreaks. When not in UTF-8 mode, corresponding linebreaks can be matched with (?>\r\n|\n|\x0b|\f|\r|\x85) or \R. In UTF-8 mode, two additional characters are recognized as line breaks with (*ANY): LS (line separator, U+2028), and PS (paragraph separator, U+2029). On Windows, in non-Unicode data, some of the ANY linebreak characters have other meanings. For example, \x85 can match a horizontal ellipsis, and if encountered while the ANY newline is in effect, it would trigger newline processing. See below for configuration and options concerning what matches Backslash-R.

Backslash-R options: New in version 7.4: When PCRE is compiled, a default is selected for what matches \R. The default can be either to match the linebreaks associated ANYCRLF or those corresponding to ANY. The default can be overridden when necessary by including (*BSR_UNICODE) or (*BSR_ANYCRLF) at the start of the pattern. When providing a (*BSR..) option, you can also provide a (*newline) option, e.g., (*BSR_UNICODE)(*ANY)rest-of-pattern. The Backslash-R options also can be changed with external options by the application calling PCRE, when a pattern is compiled as well as when it is run.
Beginning of pattern options: Linebreak options such as (*LF) documented above; Backslash-R options such as (*BSR_ANYCRLF) documented above; Unicode Character Properties option (*UCP) documented above; and, (*UTF8) option documented as follows: Since version 7.9, if your PCRE library has been compiled with UTF-8

UTF-8

UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

support, you can specify the (*UTF8) option at the beginning of a pattern instead of setting an external option to invoke UTF-8 mode.
Named subpatterns :A sub-pattern (surrounded by parentheses, like (...)) may be named by including a leading "?P" after the open-paren. Named subpatterns are a feature that PCRE adopted from Python

Python (programming language)

Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

regular expressions. Since PCRE 7.0, named groups can be defined using (?...) or (?'name'...) as well as (?P...). Named groups can then be invoked with, for example, (?...).
Backreferences :A pattern may refer back to the results of a previous match. For example, (a|b)c\1 would match "a" or "b" followed by a "c". Then it would look for the same character (an "a" or a "b") that matched in the first subpattern.
Subroutines

While a backreference provides a mechanism to refer to that part of the subject that has previously matched a subpattern, a subroutine provides a mechanism to reuse an underlying previously defined subpattern. The subpattern's options, such as case independence, are fixed when the subpattern is defined. (a.c)(?1) would match aacabc or abcadc, whereas using a backreference (a.c)\1 would not, though both would match aacaac or abcabc. Starting with version 7.7 PCRE also supports a non-Perl Oniguruma

Oniguruma

by K. Kosako is a BSD licensed regular expression library that supports a variety of character encodings. The Ruby programming language, since version 1.9, as well as PHP's multi-byte string module , use Oniguruma as their regular expression engine. It is also used in products such as Tera Term,...

construct for subroutines. They are specified using \g or \g.

Atomic grouping :Atomic grouping is a way of preventing backtracking

Backtracking

Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...

in a pattern. For example, a++bc will match as many "a"s as possible, and never back up to try one less.
Look-ahead and look-behind assertions :Patterns may assert that previous text or subsequent text contains a pattern without consuming matched text (zero-width assertion). For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab.

Look-behind assertions cannot be of uncertain length.

Since version 7.2, \K can be used in a pattern to reset the start of the current whole match. This provides a flexible alternative approach to look-behind assertions because the discarded part of the match (the part that precedes \K) need not be fixed in length.

Escape sequences for zero-width assertions :E.g. \b for matching zero-width "word boundaries", similar to (?<=\W)(?=\w)|(?<=\w)(?=\W).
Comments :A comment begins with (?# and ends at the next close-paren.
Recursive patterns :A pattern can refer back to itself recursively or to any subpattern. For example, the pattern "$(a*|(?R))*$" will match any combination of balanced parentheses and "a"s.
Generic callouts :PCRE expressions can embed "(?Cn)" where n is some number. This will call out to an external, user-defined function through the PCRE API, and can be used to embed arbitrary code in a pattern.

Differences from Perl

PCRE has the following differences in external behaviour when compared to Perl's regular expression (as of Perl 5.9.4):

Recursive matches are atomic in PCRE and non atomic in Perl: this means that "<!>!>><>>!>!>!>" =~ /^(<(?:[^<>]+|(?3)|(?1))*>)(!>!>!>)$/ will match in Perl but not in PCRE.
The value of a capture buffer deriving from the ? quantifier (match 1 or 0 times) when nested in another quantified capture buffer is different: "aba" =~ /^(a(b)?)+$/; will result in $1 containing 'a' and $2 containing undef in Perl, but in PCRE will result in $2 containing 'b'.
PCRE allows named capture buffers to be given numeric names, Perl requires the name to follow the rule of barewords: This means that \g{} is unambiguous in Perl, but potentially ambiguous in PCRE.
PCRE does not support certain "experimental" constructs in Perl: such as (??{...}) (a callback whose return is evaluated as being part of the pattern) nor the (?{}) construct, although the latter can be emulated using (?Cn). Recursion control verbs added in the Perl 5.9.x series are also not supported. Support for experimential backtracking control verbs (added in Perl 5.10) is available in PCRE since version 7.3. They are (*FAIL), (*F), (*PRUNE), (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT). Perl's corresponding use of arguments with backtracking control verbs is not generally supported. Note however that since version 8.10, PCRE supports the following verbs with a specified argument: (*MARK:markName), (*SKIP:markName), (*PRUNE:markName), and (*THEN:markName).
PCRE and Perl are slightly different in their tolerance of erroneous constructs: such as Perl allows quantifiers on the (?!) construct, which is meaningless but harmless (albeit inefficient), PCRE will produce an error. (Note that such assertions can be harmlessly quantified with PCRE beginning with version 8.13, so the cited example applies only to earlier versions).
PCRE has a hard limit on recursion depth, Perl does not: With default build options "bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/ will fail to match due to stack overflow, but Perl will match this correctly. Perl uses the heap for recursion and has no hard limit for recursion depth, whereas PCRE has a compile time hard limit.

With the exception of the above points PCRE is capable of passing the tests in the Perl 't/op/re_tests' file, one of the main syntax level regression tests for Perl's regular expression engine.

External links

PCRE home page
PCRE documentation in Windows chm help file format: [ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib/pcre-8.00.chm PCRE-8.00], [ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib/pcre-7.1.chm PCRE-7.1], [ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib/pcre-7.2.chm PCRE-7.2], [ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib/pcre-7.4.chm PCRE-7.4]

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.