[Compiler] 04. Lexical Analysis - Tokens, Patterns, and Regular Expressions

Lexical Analysis - Tokens, Patterns, and Regular Expressions

Lexical Analysis is the first phase of compilation. Let us examine the process of reading the character stream of source code and breaking it into meaningful units called tokens.

1. Role of the Lexical Analyzer

The lexical analyzer (Lexer, Scanner) performs the following tasks:

Source Program (Character Stream)
        |
        v
  +------------------+
  |  Lexical Analyzer |  <---> Symbol Table
  +------------------+
        |
        v
  Token Stream (passed to the parser)

Main Functions

Token Recognition: Groups characters into token units.
Whitespace/Comment Removal: Filters out characters unnecessary for the parser.
Source Location Tracking: Records line and column numbers for error messages.
Symbol Table Management: Registers identifiers in the symbol table.

Why Separate the Lexical Analyzer

Simplification of Design: The parser does not need to worry about whitespace and comments.
Improved Portability: Input device-dependent processing is isolated in the lexical analyzer.
Efficiency: Specialized techniques (finite automata) for token recognition can be applied.

2. Tokens, Patterns, and Lexemes

These three concepts are the core of lexical analysis.

Token

A token is an abstract name representing a lexical unit of a programming language.

Pattern

A pattern describes the form of strings that can constitute a token.

Lexeme

A lexeme is the actual string in the source code that matches a pattern.

Example

C source code: printf("Total = %d\n", score);

+----------+--------------------------+---------------------+
| Token    | Pattern                   | Lexeme Examples      |
+----------+--------------------------+---------------------+
| id       | starts with letter,       | printf, score       |
|          | letter/digit combination  |                     |
| num      | one or more digits        | 42, 3, 100          |
| literal  | characters between "      | "Total = %d\n"      |
| lparen   | (                        | (                   |
| rparen   | )                        | )                   |
| comma    | ,                        | ,                   |
| semi     | ;                        | ;                   |
+----------+--------------------------+---------------------+

Types of Tokens

1. Keyword:     if, else, while, return, ...
2. Identifier:  user-defined names
3. Constant:    integers, floating-point numbers, string literals
4. Operator:    +, -, *, /, =, ==, !=, ...
5. Delimiter:   (, ), {, }, ;, ,, ...

3. Token Attributes

Since multiple lexemes can match a single token name, additional information is needed.

Source code: position = initial + rate * 60

Token stream:
  <id, "position">
  <assign_op, >
  <id, "initial">
  <add_op, >
  <id, "rate">
  <mul_op, >
  <num, 60>

The second element here is the attribute.

The attribute of an identifier is usually a pointer to a symbol table entry. The attribute of a number is its actual value.

4. Regular Expressions

Regular expressions (regex) are used to define token patterns.

Basic Operations

The three basic operations of regular expressions defined over an alphabet Sigma are:

1. Union:          r | s
   - Strings matching r or s

2. Concatenation:  rs
   - A string matching r followed by a string matching s

3. Closure:        r*
   - Zero or more repetitions of strings matching r

Basic Regular Expressions

Expression    Description                  Match Examples
-------------------------------------------------
a             The character 'a' itself     "a"
epsilon       Empty string                 ""
r | s         r or s                       If r is "a", s is "b", then "a" or "b"
rs            r followed by s              "ab"
r*            Zero or more repetitions of r "", "a", "aa", "aaa", ...

Extended Notation

Abbreviations for convenience:

r+      = r r*            (one or more repetitions)
r?      = r | epsilon     (zero or one)
[a-z]   = a | b | ... | z (character class)
[^a-z]  = characters except a-z (negated character class)
.       = any single character

Regular Expression Examples

// Integer literal
digit  = [0-9]
digits = digit+
integer = digits

// Floating-point literal
fraction = . digits | epsilon
exponent = (E | e)(+ | - | epsilon) digits | epsilon
float_num = digits fraction exponent

// Identifier
letter = [a-zA-Z_]
identifier = letter (letter | digit)*

// String literal (characters except double quotes inside double quotes)
string_lit = " [^"]* "

5. Regular Definitions

To express complex regular expressions concisely, we assign names to regular expressions and reuse them.

// Regular definition for C identifiers
letter_  -> [a-zA-Z_]
digit    -> [0-9]
id       -> letter_ (letter_ | digit)*

// Regular definition for C integer constants
digit      -> [0-9]
nonzero    -> [1-9]
decimal    -> nonzero digit*
octal_dig  -> [0-7]
octal      -> 0 octal_dig*
hex_dig    -> [0-9a-fA-F]
hex        -> 0 (x|X) hex_dig+
integer    -> decimal | octal | hex

Note: Circular references are not allowed in regular definitions. Each name can only reference previously defined names.

6. Input Buffering

The lexical analyzer must read source code one character at a time, and buffering techniques are used to reduce I/O costs.

Double Buffering

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| E | = | M | * | C | * | * | 2 |eof| . | . | . | . | . | . | . |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  ^                               ^
  |                               |
lexemeBegin                    forward

          Buffer 1                          Buffer 2

Two buffers are used alternately to ensure continuous reading.
lexemeBegin: The start position of the current lexeme
forward: The position of the character currently being examined
A sentinel character eof is placed at the end of each buffer to simplify boundary checking.

Forward Pointer Movement with Sentinels

// Without sentinel (2 checks each time)
forward++;
if (forward >= bufferEnd) {
    // Load next buffer
    reloadBuffer();
}
if (*forward == EOF) {
    // Handle end of input
    terminate();
}

// With sentinel (usually 1 check)
forward++;
if (*forward == EOF) {
    if (forward == bufferEnd) {
        reloadBuffer();  // Sentinel at buffer end
        forward = bufferStart;
    } else {
        terminate();     // Actual end of input
    }
}

7. Token Recognition Examples

Let us examine the token recognition process for a simple language.

Relational Operator Recognition

relop -> <  |  <=  |  <>  |  >  |  >=  |  =

The recognition process is represented as a Transition Diagram:

start --'<'--> s1 --'='--> [return LE]
              |
              +--'>'--> [return NE]
              |
              +--other--> [return LT] (retract)

start --'='--> [return EQ]

start --'>'--> s2 --'='--> [return GE]
              |
              +--other--> [return GT] (retract)

retract means moving the forward pointer back one position. For example, if a character that is neither = nor > follows <, that character must be treated as part of the next token.

Identifier and Keyword Recognition

start --letter--> s1 --letter|digit--> s1
                  |
                  +--other--> [check keyword table then return]

After recognizing an identifier, the keyword table is searched to check if it is a keyword. Reserved words like if, else, and while have the same form as identifiers and require separate handling.

// Keyword table initialization
Map<String, TokenType> keywords = new HashMap<>();
keywords.put("if", TokenType.IF);
keywords.put("else", TokenType.ELSE);
keywords.put("while", TokenType.WHILE);
keywords.put("return", TokenType.RETURN);

// After recognizing an identifier
String lexeme = getLexeme();
if (keywords.containsKey(lexeme)) {
    return new Token(keywords.get(lexeme), lexeme);
} else {
    return new Token(TokenType.ID, lexeme);
}

8. Implementing a Simple Lexical Analyzer

Let us implement a simple lexical analyzer in Java.

public class SimpleLexer {
    private String input;
    private int pos;

    public SimpleLexer(String input) {
        this.input = input;
        this.pos = 0;
    }

    public Token nextToken() {
        skipWhitespace();

        if (pos >= input.length()) {
            return new Token(TokenType.EOF, "");
        }

        char ch = input.charAt(pos);

        // Digit: integer literal recognition
        if (Character.isDigit(ch)) {
            return readNumber();
        }

        // Letter: identifier or keyword recognition
        if (Character.isLetter(ch) || ch == '_') {
            return readIdentifier();
        }

        // Operators and delimiters
        switch (ch) {
            case '+': pos++; return new Token(TokenType.PLUS, "+");
            case '-': pos++; return new Token(TokenType.MINUS, "-");
            case '*': pos++; return new Token(TokenType.STAR, "*");
            case '/': pos++; return new Token(TokenType.SLASH, "/");
            case '=':
                if (peek() == '=') {
                    pos += 2;
                    return new Token(TokenType.EQ, "==");
                }
                pos++;
                return new Token(TokenType.ASSIGN, "=");
            case '(':  pos++; return new Token(TokenType.LPAREN, "(");
            case ')':  pos++; return new Token(TokenType.RPAREN, ")");
            case ';':  pos++; return new Token(TokenType.SEMI, ";");
            default:
                throw new RuntimeException(
                    "Unexpected character: " + ch + " at position " + pos
                );
        }
    }

    private Token readNumber() {
        int start = pos;
        while (pos < input.length() && Character.isDigit(input.charAt(pos))) {
            pos++;
        }
        return new Token(TokenType.NUM, input.substring(start, pos));
    }

    private Token readIdentifier() {
        int start = pos;
        while (pos < input.length() &&
               (Character.isLetterOrDigit(input.charAt(pos)) ||
                input.charAt(pos) == '_')) {
            pos++;
        }
        String lexeme = input.substring(start, pos);
        TokenType type = Keywords.lookup(lexeme);
        return new Token(type != null ? type : TokenType.ID, lexeme);
    }

    private void skipWhitespace() {
        while (pos < input.length() && Character.isWhitespace(input.charAt(pos))) {
            pos++;
        }
    }

    private char peek() {
        return (pos + 1 < input.length()) ? input.charAt(pos + 1) : '\0';
    }
}

Execution Example

Input: "x = 10 + y;"

Token stream:
  <ID, "x">
  <ASSIGN, "=">
  <NUM, "10">
  <PLUS, "+">
  <ID, "y">
  <SEMI, ";">
  <EOF, "">

9. Error Handling

Error handling strategies when the lexical analyzer encounters a character that does not match any pattern:

Panic Mode: Skips characters until a matching pattern is found.
Character Deletion: Deletes the incorrect character and continues.
Character Insertion: Inserts a missing character (e.g., matching quote pairs).
Character Replacement: Replaces an incorrect character with a correct one.

In practice, compilers continue analysis after recovery to report as many errors as possible at once.

Summary

Concept	Key Content
Token	Abstract name for a lexical unit (id, num, relop, etc.)
Pattern	Form of strings that constitute a token (regex)
Lexeme	Actual string that matches a pattern
Regular Expression	Three basic operations: union, concatenation, closure
Regular Definition	Naming regular expressions for reuse
Input Buffering	Efficient I/O with double buffers and sentinels
Transition Diagram	Token recognition process expressed as state transitions

Lexical analysis is the most fundamental phase of compilation, but it accounts for a significant portion of total compilation time, so efficient implementation is important. In the next article, we will cover finite automata that automatically generate lexical analyzers based on regular expressions.