Skip to content
Published on

[Compiler] 05. Finite Automata and Lexical Analyzer Implementation

Authors

Finite Automata and Lexical Analyzer Implementation

Regular expressions are tools for specifying token patterns, and finite automata are execution models for recognizing them. This article covers NFA, DFA, and the conversions between them.


1. Finite Automata Overview

Finite automata are mathematical models for recognizing strings. There are two types:

  • NFA (Nondeterministic Finite Automaton)
  • DFA (Deterministic Finite Automaton)

Both recognize regular languages and have the same expressive power. However, their implementation characteristics differ.

Regular Expression <=> NFA <=> DFA

- NFA: Simple to construct but complex to execute
- DFA: Complex to construct but fast to execute

2. NFA (Nondeterministic Finite Automaton)

An NFA is defined as a 5-tuple: (S, Sigma, delta, s0, F)

  • S: A finite set of states
  • Sigma: Input alphabet
  • delta: Transition function (returns a set of states for a given state and input)
  • s0: Start state
  • F: Set of accepting states

Key Characteristics of NFA

  1. Multiple transitions are possible for a single input.
  2. Epsilon transitions: States can transition without consuming input.

NFA Example: (a|b)*abb

         a
    +--------+
    |        |
    v    a   |    b         b
--> 0 -----> 1 -----> 2 -----> ((3))
    |
    +--b--+
    |     |
    +<----+

Transition Table

State |  a      |  b
-----+--------+--------
  0  | {0, 1} | {0}
  1  | -      | {2}
  2  | -      | {3}
  3  | -      | -

From state 0, reading input a can go to either state 0 or state 1. This is what nondeterministic means.

String Acceptance by NFA

An NFA accepts an input string if at least one path exists from the start state to an accepting state.

Input: "aabb"

Possible paths:
  0 -a-> 0 -a-> 0 -b-> 0 -b-> 0  (not accepting state)
  0 -a-> 0 -a-> 1 -b-> 2 -b-> 3  (accepting state!) -> Accept
  0 -a-> 1 (no transition for next a) -> Fail

3. DFA (Deterministic Finite Automaton)

A DFA is a special case of NFA with the following constraints:

  1. For each state and each input symbol, there is exactly one transition.
  2. No epsilon transitions.

DFA Example: (a|b)*abb

         b
    +--------+
    |        |
    v    a   |    b         b
--> 0 -----> 1 -----> 2 -----> ((3))
    ^        |        |
    |   b    |   a    |   a
    +--------+        |
    |                 |
    +-----------------+

Transition Table

State |  a  |  b
-----+-----+-----
  0  |  1  |  0
  1  |  1  |  2
  2  |  1  |  3
  3  |  1  |  0

In a DFA, the transition for each state and each input is unique. This makes execution simple and fast.

DFA Simulation Algorithm

public boolean simulate(DFA dfa, String input) {
    int state = dfa.startState;

    for (char ch : input.toCharArray()) {
        state = dfa.transition[state][ch];
        if (state == DFA.ERROR) {
            return false;  // Error state
        }
    }

    return dfa.isAccepting(state);
}

DFA simulation takes O(n) time, proportional to input length.


4. Regular Expression to NFA: Thompson Construction

Thompson's Construction is a systematic method for converting a regular expression into an NFA.

Basic Building Blocks

1. Single character a:
   (i) --a--> ((f))

2. Epsilon:
   (i) --epsilon--> ((f))

Union: r | s

              epsilon     NFA(r)     epsilon
          +----------> (i_r) ...> (f_r) ----------+
          |                                        |
   (i) ---+                                        +---> ((f))
          |                                        |
          +----------> (i_s) ...> (f_s) ----------+
              epsilon     NFA(s)     epsilon

Concatenation: rs

   (i_r) --> NFA(r) --> (f_r/i_s) --> NFA(s) --> ((f_s))

The accepting state of r and the start state of s are merged.

Closure: r*

                    epsilon
              +-------------------+
              |                   |
   (i) --epsilon--> (i_r) --> NFA(r) --> (f_r) --epsilon--> ((f))
     |                                                        ^
     +-------------------epsilon------------------------------+

Thompson Construction Example: (a|b)*abb

Step-by-step construction:

Step 1: NFA for a
   (2) --a--> (3)

Step 2: NFA for b
   (4) --b--> (5)

Step 3: NFA for a|b
   (1) --eps--> (2) --a--> (3) --eps--> (6)
   (1) --eps--> (4) --b--> (5) --eps--> (6)

Step 4: NFA for (a|b)*
   (0) --eps--> (1) --...--> (6) --eps--> (7)
   (0) --eps--> (7)
   (6) --eps--> (1)

Step 5: Complete (a|b)*abb
   ... --eps--> (7) --a--> (8) --b--> (9) --b--> ((10))

The characteristics of Thompson construction are:

  • The number of NFA states is at most twice the length of the regular expression.
  • Each state has at most 2 outgoing transitions.
  • There is exactly one start state and exactly one accepting state.

5. NFA to DFA: Subset Construction

Subset Construction is an algorithm that converts an NFA to an equivalent DFA.

Core Idea

Each DFA state is a set of NFA states. All states the NFA can be in simultaneously are represented as a single DFA state.

Required Operations

epsilon-closure(s): Set of all states reachable from state s via epsilon transitions only
epsilon-closure(T): Union of epsilon-closure for each state in state set T
move(T, a): Set of states reachable from state set T via transitions on input a

Algorithm

Add epsilon-closure(s0) to Dstates and mark as "unprocessed"

while (unprocessed state T exists in Dstates) do
    Mark T as "processed"
    for (each input symbol a) do
        U = epsilon-closure(move(T, a))
        if (U is not in Dstates)
            Add U to Dstates and mark as "unprocessed"
        Dtran[T, a] = U
    end for
end while

Subset Construction Example

NFA states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
NFA start: 0, NFA accept: 10

DFA state A = epsilon-closure(0) = {0, 1, 2, 4, 7}
  move(A, a) = {3, 8}
  epsilon-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = B
  move(A, b) = {5}
  epsilon-closure({5}) = {1, 2, 4, 5, 6, 7} = C

DFA state B = {1, 2, 3, 4, 6, 7, 8}
  move(B, a) = {3, 8}  -> B
  move(B, b) = {5, 9}
  epsilon-closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} = D

DFA state C = {1, 2, 4, 5, 6, 7}
  move(C, a) = {3, 8}  -> B
  move(C, b) = {5}     -> C

DFA state D = {1, 2, 4, 5, 6, 7, 9}
  move(D, a) = {3, 8}  -> B
  move(D, b) = {5, 10}
  epsilon-closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} = E

DFA state E = {1, 2, 4, 5, 6, 7, 10}  (contains 10 -> accepting state)
  move(E, a) = {3, 8}  -> B
  move(E, b) = {5}     -> C

Resulting DFA transition table:

State |  a  |  b  | Accept?
-----+-----+-----+--------
  A  |  B  |  C  |  X
  B  |  B  |  D  |  X
  C  |  B  |  C  |  X
  D  |  B  |  E  |  X
  E  |  B  |  C  |  O

6. DFA Minimization

The DFA produced by subset construction may have unnecessary states. DFA minimization finds the equivalent DFA with the minimum number of states.

Hopcroft Algorithm

1. Initial partition: {accepting states}, {non-accepting states}

2. Repeat: examine each group, and within the same group,
   separate states that transition to different groups for some input

3. Terminate when no more splits are possible

Minimization Example

Initial partition:
  G1 = {A, B, C, D}  (non-accepting)
  G2 = {E}           (accepting)

Checking input 'b':
  A -> C (G1), B -> D (G1), C -> C (G1), D -> E (G2)
  D goes to a different group, so split!

New partition:
  G1 = {A, B, C}
  G3 = {D}
  G2 = {E}

Checking input 'b':
  A -> C (G1), B -> D (G3), C -> C (G1)
  Split B!

Final partition:
  {A, C}, {B}, {D}, {E}

7. Lexical Analyzer Generation Using the Lex Tool

Lex (or Flex) is a tool that automatically generates a lexical analyzer when token patterns are described using regular expressions.

Lex Program Structure

Declarations
%%
Translation Rules
%%
Auxiliary Functions

Lex Program Example

%{
#include <stdio.h>
#include "y.tab.h"
%}

digit    [0-9]
letter   [a-zA-Z]

%%

if          { return IF; }
else        { return ELSE; }
while       { return WHILE; }

{letter}({letter}|{digit})*  { return ID; }
{digit}+                      { return NUM; }

"+"         { return PLUS; }
"-"         { return MINUS; }
"*"         { return TIMES; }
"/"         { return DIVIDE; }
"="         { return ASSIGN; }
"=="        { return EQ; }
"<"         { return LT; }
"<="        { return LE; }

"("         { return LPAREN; }
")"         { return RPAREN; }
";"         { return SEMI; }

[ \t\n]+    { /* ignore whitespace */ }
.           { printf("Unexpected: %s\n", yytext); }

%%

How Lex Works

Lex Specification (.l file)
       |
       v
  +----------+
  |   Lex    |  (Lex Compiler)
  +----------+
       |
       v
  lex.yy.c (C Source Code)
       |
       v
  +----------+
  | C Compiler|
  +----------+
       |
       v
  Lexical Analyzer (Executable)

Lex Rule Conflict Resolution

Rules when multiple patterns match simultaneously:

  1. Longest Match: Selects the pattern with the longest match.
    • If the input is ifx, it is recognized as the identifier ifx, not the keyword if.
  2. Rule Priority: If lengths are equal, the rule listed first is selected.
    • if is recognized as a keyword because the keyword rule appears before the identifier pattern.

8. Direct Conversion from Regular Expression to DFA

There is also a method to construct a DFA directly from a regular expression, bypassing Thompson construction and subset construction.

Key Functions

For node n of the regular expression's syntax tree, the following are computed:

  • nullable(n): Whether node n can generate the empty string
  • firstpos(n): Set of positions that can be the first position of strings generated by n
  • lastpos(n): Set of positions that can be the last position of strings generated by n
  • followpos(p): Set of positions that can follow position p
nullable rules:
  epsilon node     -> true
  position i node  -> false
  cat(c1, c2) node -> nullable(c1) AND nullable(c2)
  or(c1, c2) node  -> nullable(c1) OR nullable(c2)
  star(c1) node    -> true

This method bypasses the intermediate NFA step, yielding an efficient DFA with fewer states directly.


Summary

ConceptKey Content
NFANondeterministic, epsilon transitions, simple to construct
DFADeterministic, one transition per input, fast execution (O(n))
Thompson ConstructionSystematic conversion of regex to NFA
Subset ConstructionConversion of NFA to DFA (possible state explosion)
DFA MinimizationConstructing the equivalent minimum-state DFA
LexAutomatic lexical analyzer generation from regex
Longest MatchSelects the longest pattern when multiple match

Finite automata are both the theoretical foundation and the core implementation of lexical analysis. In the next article, we will move to the syntax analysis phase and explore Top-Down parsing and LL parsers.