Quex - AbsoluteAstronomy.com

Lexical analysis

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

generator that creates C

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

and C++

C++

C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

input, the creation of direct coded (non-table based) lexical analyzers and the use of inheritance relationships in lexical analysis modes.

Direct coded lexical analyzers

Quex uses traditional steps of Thompson construction to create nondeterministic finite-state machines from regular expression

Regular expression

In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

s, conversion to a deterministic finite-state machine and then Hopcroft optimization to reduce the number of states to a minimum. Those mechanisms, though, have been adapted to deal with character sets rather than single characters. By means of this the calculation time can be significantly reduced. Since the Unicode character set consists of many more code points than plain ASCII

ASCII

The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

, those optimizations are necessary in order to produce lexical analysers in a reasonable amount of time.

Instead of construction of a table based lexical analyzer where the transition information is stored in a data structure, Quex generates C/C++ code to perform transitions.
Direct coding creates lexical analyzers that structurally more closely resemble typical hand written lexical analyzers than table based lexers. Also direct coded lexers tend to perform better than analogous table based lexical analyzers.

Unicode input alphabets

Quex can handle input alphabets that contain the full Unicode code point range (0 to 10FFFFh). This is augmented by the ability to specify regular expressions that contain Unicode properties as expressions. For example, Unicode code points with the binary property XID_Start can be specified with the expression \P{XID_Start} or \P{XIDS}. Quex can also generate code to call iconv

Iconv

iconv is a computer program and a standardized API used to convert between different character encodings.-iconv API:The iconv API is the standard programming interface for converting character strings from one character encoding to another in Unix-like operating systems.Initially appearing on the...

or ICU

International Components for Unicode

International Components for Unicode is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all...

to perform character conversion. Quex relies directly on databases as they are delivered by the Unicode Consortium

Unicode Consortium

The Unicode Consortium is a non-profit organization that coordinates the development of the Unicode standard. Its stated goal is to eventually replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format schemes, claiming that many of the existing...

. Updating to new releases of the standard consists only of copying the correspondent database files into Quex's corresponding directory.

Lexical analysis modes

Like traditional lexical analyzers (e.g. Lex and Flex

Flex lexical analyser

flex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...

), Quex supports multiple lexical analysis modes in a lexer. In addition to pattern actions, Quex modes can specify event actions: code to be executed during events such as entering or exiting a mode or when any match is found. Quex modes can be also be related by inheritance which allows modes to share common pattern and event actions.

Sophisticated buffer handling

Quex provides sophisticated mechanism of buffer handling and reload that are at the same time efficient and flexible. Quex provides interfaces that allow users to virtually to plug-in any character set converter. The converters are activated only "on-demand", that is, when new buffer filling is required. By default Quex can plug-in the iconv

Iconv

library. By means of this backbone Quex is able to analyze a huge set of character encodings.

Example

Quex follows the syntax of the classical tools lex and flex for the description of regular expressions. The example in the section Flex

Flex lexical analyser

can be translated into Quex source code as follows:

header {
#include // C++ version of 'stdlib.h'
}

define {
digit [0-9]
letter [a-zA-Z]
}

mode X :

{
"+" => QUEX_TKN_PLUS;
"-" => QUEX_TKN_MINUS;
"*" => QUEX_TKN_TIMES;
"/" => QUEX_TKN_SLASH;
"(" => QUEX_TKN_LPAREN;
")" => QUEX_TKN_RPAREN;
";" => QUEX_TKN_SEMICOLON;
"," => QUEX_TKN_COMMA;
"." => QUEX_TKN_PERIOD;
":=" => QUEX_TKN_BECOMES;
"=" => QUEX_TKN_EQL;
"<>" => QUEX_TKN_NEQ;
"<" => QUEX_TKN_LSS;
">" => QUEX_TKN_GTR;
"<=" => QUEX_TKN_LEQ;
">=" => QUEX_TKN_GEQ;
"begin" => QUEX_TKN_BEGINSYM;
"call" => QUEX_TKN_CALLSYM;
"const" => QUEX_TKN_CONSTSYM;
"do" => QUEX_TKN_DOSYM;
"end" => QUEX_TKN_ENDSYM;
"if" => QUEX_TKN_IFSYM;
"odd" => QUEX_TKN_ODDSYM;
"procedure" => QUEX_TKN_PROCSYM;
"then" => QUEX_TKN_THENSYM;
"var" => QUEX_TKN_VARSYM;
"while" => QUEX_TKN_WHILESYM;

{letter}({letter}|{digit})* => QUEX_TKN_IDENT(strdup(Lexeme));
{digit}+ => QUEX_TKN_NUMBER(atoi(Lexeme));
. => QUEX_TKN_UNKNOWN(Lexeme);
}

The brief token senders via the "=>" operator set the token ID of a token object with the token ID which follows the operator. The arguments following inside brackets are used to set contents of the token object. Note, that skipping whitespace can be achieved via skippers which are optimized to pass specific character sets quickly (see the "" tag). For more sophisticated token actions C-code sections can be provided, such as

...
{digit}+ {
if( is_prime_number(Lexeme) ) ++prime_number_counter;
if( is_fibonacci_number(Lexeme) ) ++fibonacci_number_counter;
self.send(QUEX_TKN_NUMBER(atoi(Lexeme)));
}
...

which might be used to do some statistics about the numbers which occur in analyzed code.

Direct coded lexical analyzers

Unicode input alphabets

Lexical analysis modes

Sophisticated buffer handling

Example

See also