Flex lexical analyser
Encyclopedia
flex is a free software
Free software
Free software, software libre or libre software is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with restrictions that only ensure that further recipients can also do...

 alternative to lex
Lex programming tool
Lex is a computer program that generates lexical analyzers . Lex is commonly used with the yacc parser generator. Lex, originally written by Mike Lesk and Eric Schmidt, is the standard lexical analyzer generator on many Unix systems, and a tool exhibiting its behavior is specified as part of the...

. It is frequently used with the free Bison
GNU bison
GNU bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser which reads sequences of tokens and decides whether the sequence conforms to the syntax...

 parser generator. Unlike Bison, flex is not part of the GNU Project
GNU Project
The GNU Project is a free software, mass collaboration project, announced on September 27, 1983, by Richard Stallman at MIT. It initiated GNU operating system development in January, 1984...

. Flex was written in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 by Vern Paxson
Vern Paxson
Vern Edward Paxson is a Professor of Computer Science at the University of California, Berkeley. He also works as an Internet researcher based at the International Computer Science Institute in Berkeley, California. His interests range from transport protocols to intrusion detection and worms...

 around 1987. He was translating a Ratfor
Ratfor
Ratfor is a programming language implemented as a preprocessor for Fortran 66. It provided modern control structures, unavailable in Fortran 66, to replace GOTOs and statement numbers.- Features :...

 generator, which had been led by Jef Poskanzer
Jef Poskanzer
Jeffrey A. Poskanzer is a computer programmer. He was the first person to post a weekly FAQ to Usenet. He developed the portable pixmap file format and Pbmplus to manipulate it. He owns the internet address acme.com , and worked on the team that ported A/UX...

.

A similar lexical scanner for C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 is flex++, which is included as part of the flex package. At the moment, flex supports generating code only for C and C++ (see flex++). The generated code does not depend on any runtime
Runtime library
In computer programming, a runtime library is a special program library used by a compiler, to implement functions built into a programming language, during the execution of a computer program...

 or external library except for a memory allocator (malloc
Malloc
C dynamic memory allocation refers to performing dynamic memory allocation in the C via a group of functions in the C standard library, namely malloc, realloc, calloc and free....

 or a user-supplied alternative) unless the input also depends on it. This can be useful in embedded
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...

 and similar situations where traditional operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

 or C runtime
C standard library
The C Standard Library is the standard library for the programming language C, as specified in the ANSI C standard.. It was developed at the same time as the C POSIX library, which is basically a superset of it...

 facilities may not be available.

Example lexical analyzer

This is an example of a scanner which does not make use of Flex (written in C) for the instructional programming language PL/0
PL/0
At least two programming languages are known as PL/0. One is a subset of IBM's general-purpose programming language PL/I.The other PL/0, covered here, is similar to but much simpler than the general-purpose programming language Pascal, intended as an educational programming language. It serves as...

.

The symbols recognized are: '+', '-', '*', '/', '=', '(', ')', ',', ';', '.', ':=', '<', '<=', '<>', '>', '>=';
numbers: 0-9 {0-9}; identifiers: a-zA-Z {a-zA-Z0-9} and keywords: begin, call, const, do, end, if, odd, procedure, then, var, while.

External variables used:

FILE *source /* The source file */
int cur_line, cur_col, err_line, err_col /* For error reporting */
int num /* Last number read stored here, for the parser */
char id[] /* Last identifier read stored here, for the parser */
Hashtab *keywords /* List of keywords */


External routines called:

error(const char msg[]) /* Report an error */
Hashtab *create_htab(int estimate) /* Create a lookup table */
int enter_htab(Hashtab *ht, char name[], void *data) /* Add an entry to a lookup table */
Entry *find_htab(Hashtab *ht, char *s) /* Find an entry in a lookup table */
void *get_htab_data(Entry *entry) /* Returns data from a lookup table */
FILE *fopen(char fn[], char mode[]) /* Opens a file for reading */
fgetc(FILE *stream) /* Read the next character from a stream */
ungetc(int ch, FILE *stream) /* Put-back a character onto a stream */
isdigit(int ch), isalpha(int ch), isalnum(int ch) /* Character classification */


External types:

Symbol /* An enumerated type of all the symbols in the PL/0 language */
Hashtab /* Represents a lookup table */
Entry /* Represents an entry in the lookup table */


Scanning is started by calling init_scan, passing the name of the source file. If the source file is successfully opened, the parser calls getsym repeatedly to return successive symbols from the source file.

The heart of the scanner, getsym, should be straightforward. First, whitespace is skipped. Then the retrieved character is classified. If the character represents a multiple-character symbol, additional processing must be done. Numbers are converted to internal form, and identifiers are checked to see if they represent a keyword.


int read_ch(void) {
int ch = fgetc(source);
cur_col++;
if (ch

'\n') {
cur_line++;
cur_col = 0;
}
return ch;
}

void put_back(int ch) {
ungetc(ch, source);
cur_col--;
if (ch

'\n') cur_line--;
}

Symbol getsym(void) {
int ch;

while ((ch = read_ch) != EOF && ch <= ' ')
;
err_line = cur_line;
err_col = cur_col;
switch (ch) {
case EOF: return eof;
case '+': return plus;
case '-': return minus;
case '*': return times;
case '/': return slash;
case '=': return eql;
case '(': return lparen;
case ')': return rparen;
case ',': return comma;
case ';': return semicolon;
case '.': return period;
case ':':
ch = read_ch;
return (ch

'=') ? becomes : nul;
case '<':
ch = read_ch;
if (ch

'>') return neq;
if (ch

'=') return leq;
put_back(ch);
return lss;
case '>':
ch = read_ch;
if (ch

'=') return geq;
put_back(ch);
return gtr;
default:
if (isdigit(ch)) {
num = 0;
do { /* no checking for overflow! */
num = 10 * num + ch - '0';
ch = read_ch;
} while ( ch != EOF && isdigit(ch));
put_back(ch);
return number;
}
if (isalpha(ch)) {
Entry *entry;
id_len = 0;
do {
if (id_len < MAX_ID) {
id[id_len] = (char)ch;
id_len++;
}
ch = read_ch;
} while ( ch != EOF && isalnum(ch));
id[id_len] = '\0';
put_back(ch);
entry = find_htab(keywords, id);
return entry ? (Symbol)get_htab_data(entry) : ident;
}

error("getsym: invalid character '%c'", ch);
return nul;
}
}

int init_scan(const char fn[]) {
if ((source = fopen(fn, "r")) NULL) return 0;
cur_line = 1;
cur_col = 0;
keywords = create_htab(11);
enter_htab(keywords, "begin", beginsym);
enter_htab(keywords, "call", callsym);
enter_htab(keywords, "const", constsym);
enter_htab(keywords, "do", dosym);
enter_htab(keywords, "end", endsym);
enter_htab(keywords, "if", ifsym);
enter_htab(keywords, "odd", oddsym);
enter_htab(keywords, "procedure", procsym);
enter_htab(keywords, "then", thensym);
enter_htab(keywords, "var", varsym);
enter_htab(keywords, "while", whilesym);
return 1;
}


Now, contrast the above code with the code needed for a flex generated scanner for the same language:


%{
  1. include "y.tab.h"

%}

digit [0-9]
letter [a-zA-Z]

%%
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return SLASH; }
"(" { return LPAREN; }
")" { return RPAREN; }
";" { return SEMICOLON; }
"," { return COMMA; }
"." { return PERIOD; }
":=" { return BECOMES; }
"=" { return EQL; }
"<>" { return NEQ; }
"<" { return LSS; }
">" { return GTR; }
"<=" { return LEQ; }
">=" { return GEQ; }
"begin" { return BEGINSYM; }
"call" { return CALLSYM; }
"const" { return CONSTSYM; }
"do" { return DOSYM; }
"end" { return ENDSYM; }
"if" { return IFSYM; }
"odd" { return ODDSYM; }
"procedure" { return PROCSYM; }
"then" { return THENSYM; }
"var" { return VARSYM; }
"while" { return WHILESYM; }
{letter}({letter}|{digit})* {
yylval.id = (char *)strdup(yytext);
return IDENT; }
{digit}+ { yylval.num = atoi(yytext);
return NUMBER; }
[ \t\n\r] /* skip whitespace */
. { printf("Unknown character [%c]\n",yytext[0]);
return UNKNOWN; }
%%

int yywrap(void){return 1;}


About 50 lines of code for flex versus about 100 lines of hand-written code.

Time complexity

A Flex lexical analyzer sometimes has time complexity in the length of the input. That is, it performs a constant number of operations for each input symbol. This constant is quite low: GCC
GNU Compiler Collection
The GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain...

 generates 12 instructions for the DFA match loop. Note that the constant is independent of the length of the token, the length of the regular expression and the size of the DFA.

However, one optional feature of Flex can cause Flex to generate a scanner with non-linear performance: The use of the REJECT macro in a scanner with the potential to match extremely long tokens. In this case, the programmer has explicitly told flex to "go back and try again" after it has already matched some input. This will cause the DFA to backtrack to find other accept states. In theory, the time complexity is where is the length of the longest token (this reverts to if tokens are "small" with respect to the input size). The REJECT feature is not enabled by default, and its performance implications are thoroughly documented in the Flex manual.

Reentrancy

By default the scanner generated by Flex is not reentrant. This can cause serious problems for programs that use the generated scanner from different threads. To overcome this issue there are options that Flex provides in order to achieve reentrancy. A detailed description of these options can be found in the Flex manual.

Usage under non-Unix environments

Normally the generated scanner contains references to unistd.h header file which is Unix
Unix-like
A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....

 specific. To avoid generating code that includes unistd.h, %option nounistd should be used. Another issue is the call to isatty
Not a typewriter
In computer science "Not a typewriter" or ENOTTY is an error code defined in the errno.h found on many Unix systems. This code is used to indicate that an attempt has been made to use a non-TTY device as a TTY device.- Details :...

(a Unix library function), which can be found in the generated code. The %option never-interactive forces flex to generate code that doesn't use isatty. These options are detailed in the Flex manual.

Using flex from other languages

Flex can only generate code for C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 and C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

. To use the scanner code generated by flex from other languages a language binding
Language binding
In computing, a binding from a programming language to a library or OS service is an API providing that service in the language.Many software libraries are written in systems programming languages such as C or C++...

 tool such as SWIG
SWIG
SWIG is an open source software tool used to connect computer programs or libraries written in C or C++ with scripting languages such as Lua, Perl, PHP, Python, R, Ruby, Tcl, and other languages like C#, Java, Modula-3, Objective Caml, Octave, and Scheme...

 can be used.
Flex++
Flex++ is a tool for creating a language parsing program. A parser generator creates a language parsing program. It is a general instantiation of the flex program.

These programs perform character parsing, and tokenizing via the use of a deterministic finite automata or Deterministic finite state machine
Deterministic finite state machine
In the theory of computation and automata theory, a deterministic finite state machine—also known as deterministic finite automaton —is a finite state machine accepting finite strings of symbols. For each state, there is a transition arrow leading out to a next state for each symbol...

 (DFA
Deterministic finite state machine
In the theory of computation and automata theory, a deterministic finite state machine—also known as deterministic finite automaton —is a finite state machine accepting finite strings of symbols. For each state, there is a transition arrow leading out to a next state for each symbol...

). A DFA (or NDFA
Nondeterministic finite state machine
In the automata theory, a nondeterministic finite state machine or nondeterministic finite automaton is a finite state machine where from each state and a given input symbol the automaton may jump into several possible next states...

) is a theoretical machine accepting regular language
Regular language
In theoretical computer science and formal language theory, a regular language is a formal language that can be expressed using regular expression....

s. These machines are a subset of the collection of Turing machine
Turing machine
A Turing machine is a theoretical device that manipulates symbols on a strip of tape according to a table of rules. Despite its simplicity, a Turing machine can be adapted to simulate the logic of any computer algorithm, and is particularly useful in explaining the functions of a CPU inside a...

s. DFAs are equivalent to read only right moving Turing Machines
Read only right moving Turing Machines
Read-only right moving Turing Machines are a particular type of Turing Machine.- Definition :The definition based on a single infinite tape defined to be a 7-tupleM= \langle Q, \Gamma, b, \Sigma, \delta, q_0, F \rangle where...

 or NDFAs
Nondeterministic finite state machine
In the automata theory, a nondeterministic finite state machine or nondeterministic finite automaton is a finite state machine where from each state and a given input symbol the automaton may jump into several possible next states...

. The syntax is based on the use of Regular expressions.

Flex provides two different ways to generate scanners. It primarily generates C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 code to be compiled as opposed to C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 libraries and code. Flex++, an extension of flex, is used for generating C++ code and classes. The Flex++ classes and code require a C++ compiler to create lexical and pattern-matching programs. Flex, the alternative language parser, defaults to generating a parsing scanner in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 code. The Flex++ generated C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 scanner includes the header file FlexLexer.h, which defines the interfaces of the two C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 generated classes.

Flex creators

Vern Paxson
Vern Paxson
Vern Edward Paxson is a Professor of Computer Science at the University of California, Berkeley. He also works as an Internet researcher based at the International Computer Science Institute in Berkeley, California. His interests range from transport protocols to intrusion detection and worms...

, with the help of many ideas and much inspiration from Van Jacobson
Van Jacobson
Van Jacobson is one of the primary contributors to the TCP/IP protocol stack which is the technological foundation of today’s Internet. He is renowned for his pioneering achievements in network performance and scaling....

.
See also
  • John Levine, flex & bison, O'Reilly and Associates
  • M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator
  • Alfred Aho, Ravi Sethi and Jeffrey Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesley (1986). Describes the pattern-matching techniques used by flex (deterministic finite automata)

  • Lex
    Lex programming tool
    Lex is a computer program that generates lexical analyzers . Lex is commonly used with the yacc parser generator. Lex, originally written by Mike Lesk and Eric Schmidt, is the standard lexical analyzer generator on many Unix systems, and a tool exhibiting its behavior is specified as part of the...

  • Ragel
    Ragel
    Ragel is a finite-state machine compiler with output support for C, C++, C#, Objective-C, D, Java, Go and Ruby source code. It supports the generation of table or control flow driven state machines from regular expressions and/or state charts and can also build lexical analysers via the...

  • Quex
    Quex
    Quex is a lexical analyzer generator that creates C and C++ lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode input, the creation of direct coded lexical analyzers and the use of inheritance relationships in lexical analysis modes.-...

  • yacc
    Yacc
    The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...

  • GNU Bison
    GNU bison
    GNU bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser which reads sequences of tokens and decides whether the sequence conforms to the syntax...

  • Berkeley Yacc
    Berkeley Yacc
    Berkeley Yacc is a reimplementation of the Unix parser generator Yacc, originally written by Robert Corbett in 1990, designed for compatibility with Yacc. Due to its liberal license and because it was faster than the AT&T Yacc, it quickly became the most popular version of Yacc...


External links
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK