- Authors

- Name
- Youngju Kim
- @fjvbn20031
Finite Automata and Lexical Analyzer Implementation
Regular expressions are tools for specifying token patterns, and finite automata are execution models for recognizing them. This article covers NFA, DFA, and the conversions between them.
1. Finite Automata Overview
Finite automata are mathematical models for recognizing strings. There are two types:
- NFA (Nondeterministic Finite Automaton)
- DFA (Deterministic Finite Automaton)
Both recognize regular languages and have the same expressive power. However, their implementation characteristics differ.
Regular Expression <=> NFA <=> DFA
- NFA: Simple to construct but complex to execute
- DFA: Complex to construct but fast to execute
2. NFA (Nondeterministic Finite Automaton)
An NFA is defined as a 5-tuple: (S, Sigma, delta, s0, F)
- S: A finite set of states
- Sigma: Input alphabet
- delta: Transition function (returns a set of states for a given state and input)
- s0: Start state
- F: Set of accepting states
Key Characteristics of NFA
- Multiple transitions are possible for a single input.
- Epsilon transitions: States can transition without consuming input.
NFA Example: (a|b)*abb
a
+--------+
| |
v a | b b
--> 0 -----> 1 -----> 2 -----> ((3))
|
+--b--+
| |
+<----+
Transition Table
State | a | b
-----+--------+--------
0 | {0, 1} | {0}
1 | - | {2}
2 | - | {3}
3 | - | -
From state 0, reading input a can go to either state 0 or state 1. This is what nondeterministic means.
String Acceptance by NFA
An NFA accepts an input string if at least one path exists from the start state to an accepting state.
Input: "aabb"
Possible paths:
0 -a-> 0 -a-> 0 -b-> 0 -b-> 0 (not accepting state)
0 -a-> 0 -a-> 1 -b-> 2 -b-> 3 (accepting state!) -> Accept
0 -a-> 1 (no transition for next a) -> Fail
3. DFA (Deterministic Finite Automaton)
A DFA is a special case of NFA with the following constraints:
- For each state and each input symbol, there is exactly one transition.
- No epsilon transitions.
DFA Example: (a|b)*abb
b
+--------+
| |
v a | b b
--> 0 -----> 1 -----> 2 -----> ((3))
^ | |
| b | a | a
+--------+ |
| |
+-----------------+
Transition Table
State | a | b
-----+-----+-----
0 | 1 | 0
1 | 1 | 2
2 | 1 | 3
3 | 1 | 0
In a DFA, the transition for each state and each input is unique. This makes execution simple and fast.
DFA Simulation Algorithm
public boolean simulate(DFA dfa, String input) {
int state = dfa.startState;
for (char ch : input.toCharArray()) {
state = dfa.transition[state][ch];
if (state == DFA.ERROR) {
return false; // Error state
}
}
return dfa.isAccepting(state);
}
DFA simulation takes O(n) time, proportional to input length.
4. Regular Expression to NFA: Thompson Construction
Thompson's Construction is a systematic method for converting a regular expression into an NFA.
Basic Building Blocks
1. Single character a:
(i) --a--> ((f))
2. Epsilon:
(i) --epsilon--> ((f))
Union: r | s
epsilon NFA(r) epsilon
+----------> (i_r) ...> (f_r) ----------+
| |
(i) ---+ +---> ((f))
| |
+----------> (i_s) ...> (f_s) ----------+
epsilon NFA(s) epsilon
Concatenation: rs
(i_r) --> NFA(r) --> (f_r/i_s) --> NFA(s) --> ((f_s))
The accepting state of r and the start state of s are merged.
Closure: r*
epsilon
+-------------------+
| |
(i) --epsilon--> (i_r) --> NFA(r) --> (f_r) --epsilon--> ((f))
| ^
+-------------------epsilon------------------------------+
Thompson Construction Example: (a|b)*abb
Step-by-step construction:
Step 1: NFA for a
(2) --a--> (3)
Step 2: NFA for b
(4) --b--> (5)
Step 3: NFA for a|b
(1) --eps--> (2) --a--> (3) --eps--> (6)
(1) --eps--> (4) --b--> (5) --eps--> (6)
Step 4: NFA for (a|b)*
(0) --eps--> (1) --...--> (6) --eps--> (7)
(0) --eps--> (7)
(6) --eps--> (1)
Step 5: Complete (a|b)*abb
... --eps--> (7) --a--> (8) --b--> (9) --b--> ((10))
The characteristics of Thompson construction are:
- The number of NFA states is at most twice the length of the regular expression.
- Each state has at most 2 outgoing transitions.
- There is exactly one start state and exactly one accepting state.
5. NFA to DFA: Subset Construction
Subset Construction is an algorithm that converts an NFA to an equivalent DFA.
Core Idea
Each DFA state is a set of NFA states. All states the NFA can be in simultaneously are represented as a single DFA state.
Required Operations
epsilon-closure(s): Set of all states reachable from state s via epsilon transitions only
epsilon-closure(T): Union of epsilon-closure for each state in state set T
move(T, a): Set of states reachable from state set T via transitions on input a
Algorithm
Add epsilon-closure(s0) to Dstates and mark as "unprocessed"
while (unprocessed state T exists in Dstates) do
Mark T as "processed"
for (each input symbol a) do
U = epsilon-closure(move(T, a))
if (U is not in Dstates)
Add U to Dstates and mark as "unprocessed"
Dtran[T, a] = U
end for
end while
Subset Construction Example
NFA states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
NFA start: 0, NFA accept: 10
DFA state A = epsilon-closure(0) = {0, 1, 2, 4, 7}
move(A, a) = {3, 8}
epsilon-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = B
move(A, b) = {5}
epsilon-closure({5}) = {1, 2, 4, 5, 6, 7} = C
DFA state B = {1, 2, 3, 4, 6, 7, 8}
move(B, a) = {3, 8} -> B
move(B, b) = {5, 9}
epsilon-closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} = D
DFA state C = {1, 2, 4, 5, 6, 7}
move(C, a) = {3, 8} -> B
move(C, b) = {5} -> C
DFA state D = {1, 2, 4, 5, 6, 7, 9}
move(D, a) = {3, 8} -> B
move(D, b) = {5, 10}
epsilon-closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} = E
DFA state E = {1, 2, 4, 5, 6, 7, 10} (contains 10 -> accepting state)
move(E, a) = {3, 8} -> B
move(E, b) = {5} -> C
Resulting DFA transition table:
State | a | b | Accept?
-----+-----+-----+--------
A | B | C | X
B | B | D | X
C | B | C | X
D | B | E | X
E | B | C | O
6. DFA Minimization
The DFA produced by subset construction may have unnecessary states. DFA minimization finds the equivalent DFA with the minimum number of states.
Hopcroft Algorithm
1. Initial partition: {accepting states}, {non-accepting states}
2. Repeat: examine each group, and within the same group,
separate states that transition to different groups for some input
3. Terminate when no more splits are possible
Minimization Example
Initial partition:
G1 = {A, B, C, D} (non-accepting)
G2 = {E} (accepting)
Checking input 'b':
A -> C (G1), B -> D (G1), C -> C (G1), D -> E (G2)
D goes to a different group, so split!
New partition:
G1 = {A, B, C}
G3 = {D}
G2 = {E}
Checking input 'b':
A -> C (G1), B -> D (G3), C -> C (G1)
Split B!
Final partition:
{A, C}, {B}, {D}, {E}
7. Lexical Analyzer Generation Using the Lex Tool
Lex (or Flex) is a tool that automatically generates a lexical analyzer when token patterns are described using regular expressions.
Lex Program Structure
Declarations
%%
Translation Rules
%%
Auxiliary Functions
Lex Program Example
%{
#include <stdio.h>
#include "y.tab.h"
%}
digit [0-9]
letter [a-zA-Z]
%%
if { return IF; }
else { return ELSE; }
while { return WHILE; }
{letter}({letter}|{digit})* { return ID; }
{digit}+ { return NUM; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"=" { return ASSIGN; }
"==" { return EQ; }
"<" { return LT; }
"<=" { return LE; }
"(" { return LPAREN; }
")" { return RPAREN; }
";" { return SEMI; }
[ \t\n]+ { /* ignore whitespace */ }
. { printf("Unexpected: %s\n", yytext); }
%%
How Lex Works
Lex Specification (.l file)
|
v
+----------+
| Lex | (Lex Compiler)
+----------+
|
v
lex.yy.c (C Source Code)
|
v
+----------+
| C Compiler|
+----------+
|
v
Lexical Analyzer (Executable)
Lex Rule Conflict Resolution
Rules when multiple patterns match simultaneously:
- Longest Match: Selects the pattern with the longest match.
- If the input is
ifx, it is recognized as the identifierifx, not the keywordif.
- If the input is
- Rule Priority: If lengths are equal, the rule listed first is selected.
ifis recognized as a keyword because the keyword rule appears before the identifier pattern.
8. Direct Conversion from Regular Expression to DFA
There is also a method to construct a DFA directly from a regular expression, bypassing Thompson construction and subset construction.
Key Functions
For node n of the regular expression's syntax tree, the following are computed:
- nullable(n): Whether node n can generate the empty string
- firstpos(n): Set of positions that can be the first position of strings generated by n
- lastpos(n): Set of positions that can be the last position of strings generated by n
- followpos(p): Set of positions that can follow position p
nullable rules:
epsilon node -> true
position i node -> false
cat(c1, c2) node -> nullable(c1) AND nullable(c2)
or(c1, c2) node -> nullable(c1) OR nullable(c2)
star(c1) node -> true
This method bypasses the intermediate NFA step, yielding an efficient DFA with fewer states directly.
Summary
| Concept | Key Content |
|---|---|
| NFA | Nondeterministic, epsilon transitions, simple to construct |
| DFA | Deterministic, one transition per input, fast execution (O(n)) |
| Thompson Construction | Systematic conversion of regex to NFA |
| Subset Construction | Conversion of NFA to DFA (possible state explosion) |
| DFA Minimization | Constructing the equivalent minimum-state DFA |
| Lex | Automatic lexical analyzer generation from regex |
| Longest Match | Selects the longest pattern when multiple match |
Finite automata are both the theoretical foundation and the core implementation of lexical analysis. In the next article, we will move to the syntax analysis phase and explore Top-Down parsing and LL parsers.