[Compiler] 05. Finite Automata and Lexical Analyzer Implementation

Finite Automata and Lexical Analyzer Implementation

Regular expressions are tools for specifying token patterns, and finite automata are execution models for recognizing them. This article covers NFA, DFA, and the conversions between them.

1. Finite Automata Overview

Finite automata are mathematical models for recognizing strings. There are two types:

NFA (Nondeterministic Finite Automaton)
DFA (Deterministic Finite Automaton)

Both recognize regular languages and have the same expressive power. However, their implementation characteristics differ.

Regular Expression <=> NFA <=> DFA

- NFA: Simple to construct but complex to execute
- DFA: Complex to construct but fast to execute

2. NFA (Nondeterministic Finite Automaton)

An NFA is defined as a 5-tuple: (S, Sigma, delta, s0, F)

S: A finite set of states
Sigma: Input alphabet
delta: Transition function (returns a set of states for a given state and input)
s0: Start state
F: Set of accepting states

Key Characteristics of NFA

Multiple transitions are possible for a single input.
Epsilon transitions: States can transition without consuming input.

NFA Example: `(a|b)*abb`

         a
    +--------+
    |        |
    v    a   |    b         b
--> 0 -----> 1 -----> 2 -----> ((3))
    |
    +--b--+
    |     |
    +<----+

Transition Table

State |  a      |  b
-----+--------+--------
  0  | {0, 1} | {0}
  1  | -      | {2}
  2  | -      | {3}
  3  | -      | -

From state 0, reading input a can go to either state 0 or state 1. This is what nondeterministic means.

String Acceptance by NFA

An NFA accepts an input string if at least one path exists from the start state to an accepting state.

Input: "aabb"

Possible paths:
  0 -a-> 0 -a-> 0 -b-> 0 -b-> 0  (not accepting state)
  0 -a-> 0 -a-> 1 -b-> 2 -b-> 3  (accepting state!) -> Accept
  0 -a-> 1 (no transition for next a) -> Fail

3. DFA (Deterministic Finite Automaton)

A DFA is a special case of NFA with the following constraints:

For each state and each input symbol, there is exactly one transition.
No epsilon transitions.

DFA Example: `(a|b)*abb`

         b
    +--------+
    |        |
    v    a   |    b         b
--> 0 -----> 1 -----> 2 -----> ((3))
    ^        |        |
    |   b    |   a    |   a
    +--------+        |
    |                 |
    +-----------------+

Transition Table

State |  a  |  b
-----+-----+-----
  0  |  1  |  0
  1  |  1  |  2
  2  |  1  |  3
  3  |  1  |  0

In a DFA, the transition for each state and each input is unique. This makes execution simple and fast.

DFA Simulation Algorithm

public boolean simulate(DFA dfa, String input) {
    int state = dfa.startState;

    for (char ch : input.toCharArray()) {
        state = dfa.transition[state][ch];
        if (state == DFA.ERROR) {
            return false;  // Error state
        }
    }

    return dfa.isAccepting(state);
}

DFA simulation takes O(n) time, proportional to input length.

4. Regular Expression to NFA: Thompson Construction

Thompson's Construction is a systematic method for converting a regular expression into an NFA.

Basic Building Blocks

1. Single character a:
   (i) --a--> ((f))

2. Epsilon:
   (i) --epsilon--> ((f))

Union: r | s

              epsilon     NFA(r)     epsilon
          +----------> (i_r) ...> (f_r) ----------+
          |                                        |
   (i) ---+                                        +---> ((f))
          |                                        |
          +----------> (i_s) ...> (f_s) ----------+
              epsilon     NFA(s)     epsilon

Concatenation: rs

   (i_r) --> NFA(r) --> (f_r/i_s) --> NFA(s) --> ((f_s))

The accepting state of r and the start state of s are merged.

Closure: r*

                    epsilon
              +-------------------+
              |                   |
   (i) --epsilon--> (i_r) --> NFA(r) --> (f_r) --epsilon--> ((f))
     |                                                        ^
     +-------------------epsilon------------------------------+

Thompson Construction Example: `(a|b)*abb`

Step-by-step construction:

Step 1: NFA for a
   (2) --a--> (3)

Step 2: NFA for b
   (4) --b--> (5)

Step 3: NFA for a|b
   (1) --eps--> (2) --a--> (3) --eps--> (6)
   (1) --eps--> (4) --b--> (5) --eps--> (6)

Step 4: NFA for (a|b)*
   (0) --eps--> (1) --...--> (6) --eps--> (7)
   (0) --eps--> (7)
   (6) --eps--> (1)

Step 5: Complete (a|b)*abb
   ... --eps--> (7) --a--> (8) --b--> (9) --b--> ((10))

The characteristics of Thompson construction are:

The number of NFA states is at most twice the length of the regular expression.
Each state has at most 2 outgoing transitions.
There is exactly one start state and exactly one accepting state.

5. NFA to DFA: Subset Construction

Subset Construction is an algorithm that converts an NFA to an equivalent DFA.

Core Idea

Each DFA state is a set of NFA states. All states the NFA can be in simultaneously are represented as a single DFA state.

Required Operations

epsilon-closure(s): Set of all states reachable from state s via epsilon transitions only
epsilon-closure(T): Union of epsilon-closure for each state in state set T
move(T, a): Set of states reachable from state set T via transitions on input a

Algorithm

Add epsilon-closure(s0) to Dstates and mark as "unprocessed"

while (unprocessed state T exists in Dstates) do
    Mark T as "processed"
    for (each input symbol a) do
        U = epsilon-closure(move(T, a))
        if (U is not in Dstates)
            Add U to Dstates and mark as "unprocessed"
        Dtran[T, a] = U
    end for
end while

Subset Construction Example

NFA states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
NFA start: 0, NFA accept: 10

DFA state A = epsilon-closure(0) = {0, 1, 2, 4, 7}
  move(A, a) = {3, 8}
  epsilon-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = B
  move(A, b) = {5}
  epsilon-closure({5}) = {1, 2, 4, 5, 6, 7} = C

DFA state B = {1, 2, 3, 4, 6, 7, 8}
  move(B, a) = {3, 8}  -> B
  move(B, b) = {5, 9}
  epsilon-closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} = D

DFA state C = {1, 2, 4, 5, 6, 7}
  move(C, a) = {3, 8}  -> B
  move(C, b) = {5}     -> C

DFA state D = {1, 2, 4, 5, 6, 7, 9}
  move(D, a) = {3, 8}  -> B
  move(D, b) = {5, 10}
  epsilon-closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} = E

DFA state E = {1, 2, 4, 5, 6, 7, 10}  (contains 10 -> accepting state)
  move(E, a) = {3, 8}  -> B
  move(E, b) = {5}     -> C

Resulting DFA transition table:

State |  a  |  b  | Accept?
-----+-----+-----+--------
  A  |  B  |  C  |  X
  B  |  B  |  D  |  X
  C  |  B  |  C  |  X
  D  |  B  |  E  |  X
  E  |  B  |  C  |  O

6. DFA Minimization

The DFA produced by subset construction may have unnecessary states. DFA minimization finds the equivalent DFA with the minimum number of states.

Hopcroft Algorithm

1. Initial partition: {accepting states}, {non-accepting states}

2. Repeat: examine each group, and within the same group,
   separate states that transition to different groups for some input

3. Terminate when no more splits are possible

Minimization Example

Initial partition:
  G1 = {A, B, C, D}  (non-accepting)
  G2 = {E}           (accepting)

Checking input 'b':
  A -> C (G1), B -> D (G1), C -> C (G1), D -> E (G2)
  D goes to a different group, so split!

New partition:
  G1 = {A, B, C}
  G3 = {D}
  G2 = {E}

Checking input 'b':
  A -> C (G1), B -> D (G3), C -> C (G1)
  Split B!

Final partition:
  {A, C}, {B}, {D}, {E}

7. Lexical Analyzer Generation Using the Lex Tool

Lex (or Flex) is a tool that automatically generates a lexical analyzer when token patterns are described using regular expressions.

Lex Program Structure

Declarations
%%
Translation Rules
%%
Auxiliary Functions

Lex Program Example

%{
#include <stdio.h>
#include "y.tab.h"
%}

digit    [0-9]
letter   [a-zA-Z]

%%

if          { return IF; }
else        { return ELSE; }
while       { return WHILE; }

{letter}({letter}|{digit})*  { return ID; }
{digit}+                      { return NUM; }

"+"         { return PLUS; }
"-"         { return MINUS; }
"*"         { return TIMES; }
"/"         { return DIVIDE; }
"="         { return ASSIGN; }
"=="        { return EQ; }
"<"         { return LT; }
"<="        { return LE; }

"("         { return LPAREN; }
")"         { return RPAREN; }
";"         { return SEMI; }

[ \t\n]+    { /* ignore whitespace */ }
.           { printf("Unexpected: %s\n", yytext); }

%%

How Lex Works

Lex Specification (.l file)
       |
       v
  +----------+
  |   Lex    |  (Lex Compiler)
  +----------+
       |
       v
  lex.yy.c (C Source Code)
       |
       v
  +----------+
  | C Compiler|
  +----------+
       |
       v
  Lexical Analyzer (Executable)

Lex Rule Conflict Resolution

Rules when multiple patterns match simultaneously:

Longest Match: Selects the pattern with the longest match.
- If the input is ifx, it is recognized as the identifier ifx, not the keyword if.
Rule Priority: If lengths are equal, the rule listed first is selected.
- if is recognized as a keyword because the keyword rule appears before the identifier pattern.

8. Direct Conversion from Regular Expression to DFA

There is also a method to construct a DFA directly from a regular expression, bypassing Thompson construction and subset construction.

Key Functions

For node n of the regular expression's syntax tree, the following are computed:

nullable(n): Whether node n can generate the empty string
firstpos(n): Set of positions that can be the first position of strings generated by n
lastpos(n): Set of positions that can be the last position of strings generated by n
followpos(p): Set of positions that can follow position p

nullable rules:
  epsilon node     -> true
  position i node  -> false
  cat(c1, c2) node -> nullable(c1) AND nullable(c2)
  or(c1, c2) node  -> nullable(c1) OR nullable(c2)
  star(c1) node    -> true

This method bypasses the intermediate NFA step, yielding an efficient DFA with fewer states directly.

Summary

Concept	Key Content
NFA	Nondeterministic, epsilon transitions, simple to construct
DFA	Deterministic, one transition per input, fast execution (O(n))
Thompson Construction	Systematic conversion of regex to NFA
Subset Construction	Conversion of NFA to DFA (possible state explosion)
DFA Minimization	Constructing the equivalent minimum-state DFA
Lex	Automatic lexical analyzer generation from regex
Longest Match	Selects the longest pattern when multiple match

Finite automata are both the theoretical foundation and the core implementation of lexical analysis. In the next article, we will move to the syntax analysis phase and explore Top-Down parsing and LL parsers.

Finite Automata and Lexical Analyzer Implementation

1. Finite Automata Overview

2. NFA (Nondeterministic Finite Automaton)

Key Characteristics of NFA

NFA Example: (a|b)*abb

Transition Table

String Acceptance by NFA

3. DFA (Deterministic Finite Automaton)

DFA Example: (a|b)*abb

Transition Table

DFA Simulation Algorithm

4. Regular Expression to NFA: Thompson Construction

Basic Building Blocks

Union: r | s

Concatenation: rs

Closure: r*

Thompson Construction Example: (a|b)*abb

5. NFA to DFA: Subset Construction

Core Idea

Required Operations

Algorithm

Subset Construction Example

6. DFA Minimization

Hopcroft Algorithm

Minimization Example

7. Lexical Analyzer Generation Using the Lex Tool

Lex Program Structure

Lex Program Example

How Lex Works

Lex Rule Conflict Resolution

8. Direct Conversion from Regular Expression to DFA

Key Functions

Summary

NFA Example: `(a|b)*abb`

DFA Example: `(a|b)*abb`

Thompson Construction Example: `(a|b)*abb`