Skip to content

Instantly share code, notes, and snippets.

@TrigonaMinima
Last active September 24, 2015 20:29
Show Gist options
  • Save TrigonaMinima/95c7d94587c30c0ab186 to your computer and use it in GitHub Desktop.
Save TrigonaMinima/95c7d94587c30c0ab186 to your computer and use it in GitHub Desktop.
A description about Lex

Introduction

Lex is a computer program that generates lexical analyzers ("scanners" or "lexers"). Lex reads an input stream specifying the lexical analyzer and outputs source code implementing the lexer in the C programming language. Lex turns the user's expressions and actions (called source here) into the host general-purpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this memo) and perform the specified actions for each expression as it is detected. See Figure 1.

                 +-------+
      Source ->  |  Lex  |  -> yylex
                 +-------+

                +-------+
      Input ->  | yylex | -> Output
                +-------+

              An overview of Lex
                   Figure 1

For a trivial example, consider a program to delete from the input all blanks or tabs at the end of lines.

%%
[ \t]+$   ;

is all that is required. The program contains a %% delimiter to mark the beginning of the rules, and one rule. This rule contains a regular expression which matches one or more instances of the characters, blank or tab, just prior to the end of a line. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied.

Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate.

Lex generates a deterministic finite automaton from the regular expressions in the source. The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of rescanning. What does increase with the number and complexity of the rules is the size of the finite automaton, and therefore the size of the program generated by Lex.

In the program written by Lex, the user's fragments (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine.

Lex Source

The structure of a Lex file is intentionally similar to that of a yacc file; files are divided into three sections, separated by lines that contain only two percent signs, as follows:

{definitions}
%%
{rules}
%%
{user subroutines in C}

The definition section defines macros and imports header files written in C. It is also possible to write any C code here, which will be copied verbatim into the generated source file.

The subroutines section contains C statements and functions that are copied verbatim to the generated source file. These statements presumably contain code called by the rules in the rules section.

The rules section associates regular expression patterns with C statements. When the lexer sees the text in the input matching a given pattern, it will execute the associated C code. Rules are a table, in which the left column contains regular expressions and the right column contains actions (C code), to be executed when the expressions are recognized. For eg. An individual rule might appear

integer printf("found keyword INT");

To look for the string integer in the input stream and print the message "found keyword INT" whenever it appears. The end of the regex is indicated by the first blank or tab character. As a slightly more useful example, suppose it is desired to change a number of words from British to American spelling. Lex rules can be,

colour      printf("color");
mechanise   printf("mechanize");

Rules

As mentioned before rules specify the C statements to be executed whenever the input matches with the associated regex. Now, the table is divided into 2 parts, namely,

Lex Regular Expressions

A regular expression specifies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features).

The text characters are - a-z and A-Z, and the operator characters are, " \ [ ] ^ - ? . * + | ( ) $ / { } % < >

Some examples of the regular expressions are,

  1. integer - matches the string 'integer' wherever it appears
  2. ab?c - matches either 'ac' or 'abc'
  3. a* - is any number of consecutive 'a' characters, including zero
  4. a+ - is one or more instances of 'a'
  5. [A-Za-z][A-Za-z0-9]* - indicates all alphanumeric strings with a leading alphabetic character.

Lex Actions

When a regular expression written as above is matched, Lex executes the corresponding action. There is a default action, which consists of copying the input to the output. This is performed on all strings not otherwise matched. Thus, the Lex user who wishes to absorb the entire input, without producing any output, must provide rules to match everything.

One of the simplest things that can be done is to ignore the input. Specifying a C null statement, ; as an action causes this result. A frequent rule is

[ \t\n] ;

which causes the three spacing characters (blank, tab, and newline) to be ignored. In more complex actions, the user will often want to know the actual text that matched some expression like [a-z]+. Lex leaves this text in an external character array named yytext. Thus, to print the name found, a rule like,

[a-z]+ printf("%s", yytext);

will print the string in yytext. So, this just places the matched string on the output. This action is so common that it may be written as ECHO:

[a-z]+ ECHO;

is the same as the above. There are many more such routines provided by the Lex like - input(), output(), unput(), yymore(), etc. By default, these routines are provided as macro definitions, but the user can override them and supply private versions. These routines define the relationship between external files and internal characters, and must all be retained or modified consistently. They may be redefined, to cause input or output to be transmitted to or from strange places, including other programs or internal memory; but the character set used must be consistent in all routines; a value of zero returned by input must mean end of file; and the relationship between unput and input must be retained or the Lex lookahead will not work.

##Example

The following is an example, Lex file for the flex version of Lex. It recognizes integers, strings and junk in the input, and simply prints them out.

    /*** Definition section ***/

    %{
    /* C code to be copied verbatim */
    #include <stdio.h>
    %}

    /* This tells flex to read only one input file */
    %option noyywrap

    %%
        /*** Rules section ***/
        /* [0-9]+ will match all the numbers. */
    [0-9]+  printf("Saw an integer\t: %s\n", yytext);

        /* [a-zA-Z]+ will match all the alphabetic words. */
    [a-zA-Z]+   printf("Saw a string\t: %s\n", yytext);

        /* [ \t\n]+ will match all the blank spaces. */
    [ \t\n]+   {}

        /* [^a-zA-Z0-9]+ will match all the characters except alphabets and numbers. */
    [^a-zA-Z0-9]+    printf("Saw junk\t: %s\n", yytext);

    %%
    /*** C Code section ***/

    int main(void)
    {
        /* Call the lexer, then quit. */
        yylex();
        return 0;
    }

Save the above code in a file (here lexer.txt) and after running the following command, a file named lex.yy.c will be generated.

lex lexer.txt

This C file can be compiled into an executable which matches and outputs strings of integers. For example, given the input - abc123z.!&*2gj6 will print,

Saw a string    : abc
Saw an integer  : 123
Saw a string    : z
Saw junk        : .!&*
Saw an integer  : 2
Saw a string    : gj
Saw an integer  : 6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment