Wildmat
Encyclopedia
wildmat is a pattern matching
Pattern matching
In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact. The patterns generally have the form of either sequences or tree structures...

 library developed by Rich Salz
Rich Salz
Rich Salz is currently the technical lead for the XML appliance products at IBM. He came to IBM when he was Chief Security Officer of DataPower, which was acquired by IBM in 2005....

. Based on the wildcard syntax
Wildcard character
-Telecommunication:In telecommunications, a wildcard character is a character that may be substituted for any of a defined subset of all possible characters....

 already used in the Bourne shell
Bourne shell
The Bourne shell, or sh, was the default Unix shell of Unix Version 7 and most Unix-like systems continue to have /bin/sh - which will be the Bourne shell, or a symbolic link or hard link to a compatible shell - even when more modern shells are used by most users.Developed by Stephen Bourne at AT&T...

, wildmat provides a uniform mechanism for matching patterns across applications with simpler syntax than that typically offered by regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s. Patterns are implicitly anchored at the beginning and end of each string when testing for a match.

Pattern matching operations

There are five pattern matching operations other than a strict one-to-one match between the pattern and the source to be checked for a match.
  • Asterisk (*
    *
    * is a a typographic symbol which is called asterisk.* may also refer to:* "*", a song by M83 * "*", a song by Sadist from "Lego" * *, symbol for not out in cricket...

    ) to match any sequence of zero or more characters.
  • Question mark (?
    ?
    or is a letter derived from the Latin alphabet. Both glyphs of the majuscule and minuscule forms of this letter are based on the rotated form of a minuscule e; a similar letter with identical minuscule is used in the Pan-Nigerian Alphabet and some alphabets based on the African reference...

    ) to match any single character.
  • Set
    Set (computer science)
    In computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values. It is a computer implementation of the mathematical concept of a finite set...

     of specified characters. It is specified as a list of characters, or as a range of characters where the beginning and end of the range are separated by a minus (or dash) character, or as any combination of lists and ranges. The dash can also be included in the set as a character if it is the beginning or end of the set. This set is enclosed in square brackets. The close square bracket (]) may be used in a set if it is the first character in the set.
  • Negation
    Negation
    In logic and mathematics, negation, also called logical complement, is an operation on propositions, truth values, or semantic values more generally. Intuitively, the negation of a proposition is true when that proposition is false, and vice versa. In classical logic negation is normally identified...

     of a set. It is specified the same way as the set with the addition of a caret character (^) at the beginning of the test string just inside the open square bracket.
  • Backslash (\) character to invalidate the special meaning of the open square bracket ([), the asterisk, backslash or the question mark. Two backslashes in sequence will result in the evaluation of the backslash as a character with no special meaning.

Usage

wildmat is most commonly seen in NNTP
Network News Transfer Protocol
The Network News Transfer Protocol is an Internet application protocol used for transporting Usenet news articles between news servers and for reading and posting articles by end user client applications...

 implementations such as Salz' own INN, also in unrelated software such as GNU
GNU
GNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...

 tar.

The full wildmat syntax is unable to handle multibyte character sets
Variable-width encoding
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set for representation in a computer...

, and poses problems when the text being searched may contain multiple incompatible character sets. A simplified version of wildmat oriented toward UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 encoding has been developed by the IETF
Internet Engineering Task Force
The Internet Engineering Task Force develops and promotes Internet standards, cooperating closely with the W3C and ISO/IEC standards bodies and dealing in particular with standards of the TCP/IP and Internet protocol suite...

 NNTP working group, to be included in an standards document.

Examples

  • *foo* matches string containing "foo".
  • mini* matches anything that begins with "mini" (including the string "mini" itself).
  • ???* matches any string of three and more letters.
  • [0-9a-zA-Z] matches every single alphanumeric
    Alphanumeric
    Alphanumeric is a combination of alphabetic and numeric characters, and is used to describe the collection of Latin letters and Arabic digits or a text constructed from this collection. There are either 36 or 62 alphanumeric characters. The alphanumeric character set consists of the numbers 0 to...

     ASCII
    ASCII
    The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

    character.
  • [^]-] matches a character other than a close square bracket or a dash.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK