All Topics  
Lexical analysis

 

   Email Print
   Bookmark   Link






 

Lexical analysis



 
 
In computer science
Computer science

Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems....
, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer is often organized as separate scanner and tokenizer functions, though the boundaries may not be clearly defined.

specification of a programming language
Programming language

A programming language is a machine-readable artificial language designed to express computations that can be performed by a machine, particularly a computer....
 will include a set of rules, often expressed syntactically, specifying the set of possible character sequences that can form a token or lexeme
Lexeme

A lexeme is an abstract Unit of Morphology Semantic analysis in linguistics, that roughly corresponds to a set of forms taken by a single word....
.






Discussion
Ask a question about 'Lexical analysis'
Start a new discussion about 'Lexical analysis'
Answer questions from other users
Full Discussion Forum



Encyclopedia


In computer science
Computer science

Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems....
, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer is often organized as separate scanner and tokenizer functions, though the boundaries may not be clearly defined.

Lexical grammar

The specification of a programming language
Programming language

A programming language is a machine-readable artificial language designed to express computations that can be performed by a machine, particularly a computer....
 will include a set of rules, often expressed syntactically, specifying the set of possible character sequences that can form a token or lexeme
Lexeme

A lexeme is an abstract Unit of Morphology Semantic analysis in linguistics, that roughly corresponds to a set of forms taken by a single word....
. The whitespace characters are often ignored during lexical analysis.

Token

A token is a categorized block of text. The block of text corresponding to the token is known as a lexeme
Lexeme

A lexeme is an abstract Unit of Morphology Semantic analysis in linguistics, that roughly corresponds to a set of forms taken by a single word....
. A lexical analyzer processes lexemes to categorize them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything; it just needs to be a useful part of the structured text.

Consider this expression in the C programming language
C (programming language)

C is a general-purpose computer programming language originally developed in 1972 by Dennis Ritchie at the Bell Telephone Laboratories to implement the Unix operating system....
:
sum=3+2;
Tokenized in the following table:
lexemetoken type
sumIDENT
=ASSIGN_OP
3NUMBER
+ADD_OP
2NUMBER
;SEMICOLON


Tokens are frequently defined by regular expression
Regular expression

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters....
s, which are understood by a lexical analyzer generator such as lex
Lex programming tool

In computer science, lex is a Computer program that generates lexical analysiss . Lex is commonly used with the yacc parser generator. Lex, originally written by Eric Schmidt and Mike Lesk, is the standard lexical analyzer generator on many Unix systems, and a tool exhibiting its behavior is specified as part of the POSIX standard....
. The lexical analyzer (either generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing
Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of lexical analysis#Token to determine their grammatical structure with respect to a given formal grammar....
. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.

Scanner

The first stage, the scanner, is usually based on a finite state machine
Finite state machine

A finite state machine or finite state automaton or simply a state machine, is a model of behavior composed of a finite number of state s, transitions between those states, and actions....
. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexeme
Lexeme

A lexeme is an abstract Unit of Morphology Semantic analysis in linguistics, that roughly corresponds to a set of forms taken by a single word....
s). For instance, an integer token may contain any sequence of numerical digit
Numerical digit

In mathematics and computer science, a digit is a symbol used in numerals , to represent numbers, in Positional notation numeral systems. The name "digit" comes from the fact that the 10 digits of the hands correspond to the 10 symbols of the common base 10 number system, i.e....
 characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch
Maximal munch

In computer programming, the "maximal munch" principle is the rule that as much of the input as possible should be processed when creating some construct....
 rule). In some languages the lexeme creation rules are more complicated and may involve backtracking
Backtracking

Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution ....
 over previously read characters.

Tokenizer

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing
Parsing

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of lexical analysis#Token to determine their grammatical structure with respect to a given formal grammar....
 input.

Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.

The quick brown fox jumps over the lazy dog


A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to represent tokenized input: The quick brown fox jumps over the lazy dog A lexeme
Lexeme

A lexeme is an abstract Unit of Morphology Semantic analysis in linguistics, that roughly corresponds to a set of forms taken by a single word....
, however, is only a string of characters known to be of a certain kind (eg, a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.)

For example, in the source code of a computer program the string

net_worth_future = (assets - liabilities);


might be converted (with whitespace suppressed) into the lexical token stream:

NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON

Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a finite state machine
Finite state machine

A finite state machine or finite state automaton or simply a state machine, is a model of behavior composed of a finite number of state s, transitions between those states, and actions....
 (which is plugged into template code for compilation and execution).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English
English language

English is a West Germanic language that originated in Anglo-Saxon England and has lingua franca status in many parts of the world as a result of the military, economic, scientific, political and cultural influence of the British Empire in the 18th, 19th and early 20th centuries and that of the United States from the mid 20th century onwa...
-based language, a NAME token might be any English alphabetical character or an underscore, followed by any number of instances of any ASCII alphanumeric character or an underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are not capable of keeping count, and verifying that n is the same on both sides — unless you have a finite set of permissible values for n. It takes a full-fledged parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end. (see in the SCIP
Structure and Interpretation of Computer Programs

Structure and Interpretation of Computer Programs is a textbook published in 1985 about general computer programming concepts from MIT Press written by Massachusetts Institute of Technology professors Harold Abelson and Gerald Jay Sussman, with Julie Sussman....
 book)

The Lex programming tool
Lex programming tool

In computer science, lex is a Computer program that generates lexical analysiss . Lex is commonly used with the yacc parser generator. Lex, originally written by Eric Schmidt and Mike Lesk, is the standard lexical analyzer generator on many Unix systems, and a tool exhibiting its behavior is specified as part of the POSIX standard....
 and its compiler is designed to generate code for fast lexical analysers based on a formal description of the lexical syntax. It is not generally considered sufficient for applications with a complicated set of lexical rules and severe performance requirements; for instance, the GNU Compiler Collection
GNU Compiler Collection

The GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain....
 uses hand-written lexers.

Lexer generator

Lexical analysis can often be performed in a single pass if reading is done a character at a time. Single-pass lexers can be generated by tools such as the classic flex
Flex lexical analyser

flex is a free software alternative to lex programming tool. It is frequently used with the free GNU bison parser generator. Flex was originally written in C by Vern Paxson around 1987....
.

The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach. With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c and Quex
Quex

Quex is a Lexical analysis generator implemented in the Python programming language that creates C++ lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode input, the creation of direct coded lexical analyzers and the use of inheritance relationships in lexical analysis modes....
 have proven (e.g. ) to produce engines that are between two to three times faster than flex produced engines. It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.

The simple utility of using a scanner generator should not be discounted, especially in the developmental phase, when a language specification might change daily. The ability to express lexical constructs as regular expressions facilitates the description of a lexical analyzer. Some tools offer the specification of pre- and post-conditions which are hard to program by hand. In that case, using a scanner generator may save a lot of development time.

Lexical analyzer generators

  • Flex
    Flex lexical analyser

    flex is a free software alternative to lex programming tool. It is frequently used with the free GNU bison parser generator. Flex was originally written in C by Vern Paxson around 1987....
     - Alternative variant of the classic 'lex' (C/C++).
  • JLex
    JLex

    JLex is a Lexical analysis which produces Java language source code. It is implemented in Java. JLex was developed by Elliot Berk at Princeton University....
     - A Lexical Analyzer Generator for Java.
  • JFlex - a rewrite of JLex.
  • Quex
    Quex

    Quex is a Lexical analysis generator implemented in the Python programming language that creates C++ lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode input, the creation of direct coded lexical analyzers and the use of inheritance relationships in lexical analysis modes....
     - (or 'Que?') A Mode Oriented Lexical Analyzer Generator for C++.
  • Ragel
    Ragel

    Ragel is a finite state machine compiler with output support for C programming language, C++, Objective-C, D , Java and Ruby source code. It supports the generation of table or control flow driven state machines from regular expressions and/or state charts and can also build Lexical analysiss via the longest-match method....
     - A state machine and lexical scanner generator with output support for C, C++, Objective-C, D, Java and Ruby source code


See also

  • List of parser generators
  • List of C Sharp lexer generators
    List of C Sharp lexer generators

    The following is a list of lexer generator or scanner generator written for the language C Sharp .** A Flex like generator in C#** A C# port from JFlex Project** A C# port from JLex project** Also a Lex-like generator in C#** LEX-like lexer generator in C#, generating C# code...
  • Parsing
    Parsing

    In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of lexical analysis#Token to determine their grammatical structure with respect to a given formal grammar....