Lex is a
computer programA computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...
that generates
lexical analyzerIn computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
s ("scanners" or "lexers"). Lex is commonly used with the
yaccThe computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
parser generator. Lex, originally written by
Mike LeskMichael E. Lesk is a computer programmer.In the 1960s, Michael Lesk worked for the SMART Information Retrieval System project, wrote much of its retrieval code and did many of the retrieval experiments, as well as obtaining a PhD in Chemical Physics....
and Eric Schmidt, is the standard lexical analyzer generator on many
UnixUnix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
systems, and a tool exhibiting its behavior is specified as part of the
POSIXPOSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...
standard.
Lex reads an input stream specifying the lexical analyzer and outputs
source codeIn computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
implementing the lexer in the
C programming languageC is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
.
Though traditionally proprietary software, versions of Lex based on the original AT&T code are available as
open sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
, as part of systems such as
OpenSolarisOpenSolaris was an open source computer operating system based on Solaris created by Sun Microsystems. It was also the name of the project initiated by Sun to build a developer and user community around the software...
and
Plan 9 from Bell LabsPlan 9 from Bell Labs is a distributed operating system. It was developed primarily for research purposes as the successor to Unix by the Computing Sciences Research Center at Bell Labs between the mid-1980s and 2002...
. Another popular
open sourceThe term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
version of Lex is
flexflex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
, the "fast lexical analyzer".
Structure of a lex file
The structure of a lex file is intentionally similar to that of a yacc file; files are divided up into three sections, separated by lines that contain only two percent signs, as follows:
Definition section
%%
Rules section
%%
C code section
- The definition section is the place to define macros and to import header file
Some programming languages use header files. These files allow programmers to separate certain elements of a program's source code into reusable files. Header files commonly contain forward declarations of classes, subroutines, variables, and other identifiers...
s written in CC is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
. It is also possible to write any C code here, which will be copied verbatim into the generated source file.
- The rules section is the most important section; it associates patterns with C statement
In computer programming a statement can be thought of as the smallest standalone element of an imperative programming language. A program written in such a language is formed by a sequence of one or more statements. A statement will have internal components .Many languages In computer programming...
s. Patterns are simply regular expressionIn computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
s. When the lexer sees some text in the input matching a given pattern, it executes the associated C code. This is the basis of how lex operates.
- The C code section contains C statements and functions that are copied verbatim to the generated source file. These statements presumably contain code called by the rules in the rules section. In large programs it is more convenient to place this code in a separate file and link it in at compile
A compiler is a computer program that transforms source code written in a programming language into another computer language...
time.
Example of a lex file
The following is an example lex file for the
flexflex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
version of lex. It recognizes strings of numbers (integers) in the input, and simply prints them out.
/*** Definition section ***/
%{
/* C code to be copied verbatim */
- include
%}
/* This tells flex to read only one input file */
%option noyywrap
%%
/*** Rules section ***/
/* [0-9]+ matches a string of one or more digits */
[0-9]+ {
/* yytext is a string containing the matched text. */
printf("Saw an integer: %s\n", yytext);
}
.|\n { /* Ignore all other characters. */ }
%%
/*** C Code section ***/
int main(void)
{
/* Call the lexer, then quit. */
yylex;
return 0;
}
If this input is given to flex, it will be converted into a C file, lex.yy.c. This can be compiled into an executable which matches and outputs strings of integers. For example, given the input:
abc123z.!&*2gj6
the program will print:
Saw an integer: 123
Saw an integer: 2
Saw an integer: 6
Using Lex with parser generators
Lex and parser generators, such as
YaccThe computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
or
BisonGNU bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser which reads sequences of tokens and decides whether the sequence conforms to the syntax...
, are commonly used together. Parser generators use a
formal grammarA formal grammar is a set of formation rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax...
to parse an input stream, something which Lex cannot do using simple
regular expressionIn computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
s (Lex is limited to simple
finite state automataA finite-state machine or finite-state automaton , or simply a state machine, is a mathematical model used to design computer programs and digital logic circuits. It is conceived as an abstract machine that can be in one of a finite number of states...
).
It is typically preferable to have a (Yacc-generated, say) parser be fed a token-stream as input, rather than having it consume the input character-stream directly. Lex is often used to produce such a token-stream.
Scannerless parsing refers to where a parser consumes the input character-stream directly, without a distinct lexer.
Lex and make
make is a utility that can be used to maintain programs involving lex. Make assumes that a file that has an extension of
.l is a lex source file. The make internal macro
LFLAGS can be used to specify lex options to be invoked automatically by make.
See also
- Flex lexical analyser
flex is a free software alternative to lex. It is frequently used with the free Bison parser generator. Unlike Bison, flex is not part of the GNU Project. Flex was written in C by Vern Paxson around 1987...
- Yacc
The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
- Ragel
Ragel is a finite-state machine compiler with output support for C, C++, C#, Objective-C, D, Java, Go and Ruby source code. It supports the generation of table or control flow driven state machines from regular expressions and/or state charts and can also build lexical analysers via the...
- Quex
Quex is a lexical analyzer generator that creates C and C++ lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode input, the creation of direct coded lexical analyzers and the use of inheritance relationships in lexical analysis modes.-...
- Comparison of parser generators
This is a list of notable lexer generators and parser generators for various language classes.- Regular languages :- Deterministic context-free languages :-Parsing expression grammars, deterministic boolean grammars:...
External links