Finite Automata and Lexical Analyzer Implementation
Regular expressions are tools for **specifying** token patterns, and finite automata are execution models for **recognizing** them. This article covers NFA, DFA, and the conversions between them.
1. Finite Automata Overview
Finite automata are mathematical models for recognizing strings. There are two types:
- **NFA (Nondeterministic Finite Automaton)**
- **DFA (Deterministic Finite Automaton)**
Both recognize regular languages and have the same expressive power. However, their implementation characteristics differ.
Regular Expression <=> NFA <=> DFA
- NFA: Simple to construct but complex to execute
- DFA: Complex to construct but fast to execute
2. NFA (Nondeterministic Finite Automaton)
An NFA is defined as a 5-tuple: (S, Sigma, delta, s0, F)
- **S**: A finite set of states
- **Sigma**: Input alphabet
- **delta**: Transition function (returns a **set of states** for a given state and input)
- **s0**: Start state
- **F**: Set of accepting states
Key Characteristics of NFA
1. **Multiple transitions** are possible for a single input.
2. **Epsilon transitions**: States can transition without consuming input.
NFA Example: `(a|b)*abb`
a
+--------+
| |
v a | b b
--> 0 -----> 1 -----> 2 -----> ((3))
|
+--b--+
| |
+<----+
Transition Table
State | a | b
-----+--------+--------
0 | {0, 1} | {0}
1 | - | {2}
2 | - | {3}
3 | - | -
From state 0, reading input `a` can go to either state 0 or state 1. This is what **nondeterministic** means.
String Acceptance by NFA
An NFA accepts an input string if **at least one path exists** from the start state to an accepting state.
Input: "aabb"
Possible paths:
0 -a-> 0 -a-> 0 -b-> 0 -b-> 0 (not accepting state)
0 -a-> 0 -a-> 1 -b-> 2 -b-> 3 (accepting state!) -> Accept
0 -a-> 1 (no transition for next a) -> Fail
3. DFA (Deterministic Finite Automaton)
A DFA is a special case of NFA with the following constraints:
1. For each state and each input symbol, there is **exactly one transition**.
2. **No epsilon transitions.**
DFA Example: `(a|b)*abb`
b
+--------+
| |
v a | b b
--> 0 -----> 1 -----> 2 -----> ((3))
^ | |
| b | a | a
+--------+ |
| |
+-----------------+
Transition Table
State | a | b
-----+-----+-----
0 | 1 | 0
1 | 1 | 2
2 | 1 | 3
3 | 1 | 0
In a DFA, the transition for each state and each input is **unique**. This makes execution simple and fast.
DFA Simulation Algorithm
public boolean simulate(DFA dfa, String input) {
int state = dfa.startState;
for (char ch : input.toCharArray()) {
state = dfa.transition[state][ch];
if (state == DFA.ERROR) {
return false; // Error state
}
}
return dfa.isAccepting(state);
}
DFA simulation takes **O(n)** time, proportional to input length.
4. Regular Expression to NFA: Thompson Construction
**Thompson's Construction** is a systematic method for converting a regular expression into an NFA.
Basic Building Blocks
1. Single character a:
(i) --a--> ((f))
2. Epsilon:
(i) --epsilon--> ((f))
Union: r | s
epsilon NFA(r) epsilon
+----------> (i_r) ...> (f_r) ----------+
| |
(i) ---+ +---> ((f))
| |
+----------> (i_s) ...> (f_s) ----------+
epsilon NFA(s) epsilon
Concatenation: rs
(i_r) --> NFA(r) --> (f_r/i_s) --> NFA(s) --> ((f_s))
The accepting state of r and the start state of s are merged.
Closure: r\*
epsilon
+-------------------+
| |
(i) --epsilon--> (i_r) --> NFA(r) --> (f_r) --epsilon--> ((f))
| ^
+-------------------epsilon------------------------------+
Thompson Construction Example: `(a|b)*abb`
Step-by-step construction:
Step 1: NFA for a
(2) --a--> (3)
Step 2: NFA for b
(4) --b--> (5)
Step 3: NFA for a|b
(1) --eps--> (2) --a--> (3) --eps--> (6)
(1) --eps--> (4) --b--> (5) --eps--> (6)
Step 4: NFA for (a|b)*
(0) --eps--> (1) --...--> (6) --eps--> (7)
(0) --eps--> (7)
(6) --eps--> (1)
Step 5: Complete (a|b)*abb
... --eps--> (7) --a--> (8) --b--> (9) --b--> ((10))
The characteristics of Thompson construction are:
- The number of NFA states is **at most twice** the length of the regular expression.
- Each state has **at most 2** outgoing transitions.
- There is **exactly one** start state and **exactly one** accepting state.
5. NFA to DFA: Subset Construction
**Subset Construction** is an algorithm that converts an NFA to an equivalent DFA.
Core Idea
Each DFA state is a **set** of NFA states. All states the NFA can be in simultaneously are represented as a single DFA state.
Required Operations
epsilon-closure(s): Set of all states reachable from state s via epsilon transitions only
epsilon-closure(T): Union of epsilon-closure for each state in state set T
move(T, a): Set of states reachable from state set T via transitions on input a
Algorithm
Add epsilon-closure(s0) to Dstates and mark as "unprocessed"
while (unprocessed state T exists in Dstates) do
Mark T as "processed"
for (each input symbol a) do
U = epsilon-closure(move(T, a))
if (U is not in Dstates)
Add U to Dstates and mark as "unprocessed"
Dtran[T, a] = U
end for
end while
Subset Construction Example
NFA states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
NFA start: 0, NFA accept: 10
DFA state A = epsilon-closure(0) = {0, 1, 2, 4, 7}
move(A, a) = {3, 8}
epsilon-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = B
move(A, b) = {5}
epsilon-closure({5}) = {1, 2, 4, 5, 6, 7} = C
DFA state B = {1, 2, 3, 4, 6, 7, 8}
move(B, a) = {3, 8} -> B
move(B, b) = {5, 9}
epsilon-closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} = D
DFA state C = {1, 2, 4, 5, 6, 7}
move(C, a) = {3, 8} -> B
move(C, b) = {5} -> C
DFA state D = {1, 2, 4, 5, 6, 7, 9}
move(D, a) = {3, 8} -> B
move(D, b) = {5, 10}
epsilon-closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} = E
DFA state E = {1, 2, 4, 5, 6, 7, 10} (contains 10 -> accepting state)
move(E, a) = {3, 8} -> B
move(E, b) = {5} -> C
Resulting DFA transition table:
State | a | b | Accept?
-----+-----+-----+--------
A | B | C | X
B | B | D | X
C | B | C | X
D | B | E | X
E | B | C | O
6. DFA Minimization
The DFA produced by subset construction may have unnecessary states. **DFA minimization** finds the equivalent DFA with the minimum number of states.
Hopcroft Algorithm
1. Initial partition: {accepting states}, {non-accepting states}
2. Repeat: examine each group, and within the same group,
separate states that transition to different groups for some input
3. Terminate when no more splits are possible
Minimization Example
Initial partition:
G1 = {A, B, C, D} (non-accepting)
G2 = {E} (accepting)
Checking input 'b':
A -> C (G1), B -> D (G1), C -> C (G1), D -> E (G2)
D goes to a different group, so split!
New partition:
G1 = {A, B, C}
G3 = {D}
G2 = {E}
Checking input 'b':
A -> C (G1), B -> D (G3), C -> C (G1)
Split B!
Final partition:
{A, C}, {B}, {D}, {E}
7. Lexical Analyzer Generation Using the Lex Tool
**Lex** (or Flex) is a tool that automatically generates a lexical analyzer when token patterns are described using regular expressions.
Lex Program Structure
Declarations
%%
Translation Rules
%%
Auxiliary Functions
Lex Program Example
%{
#include <stdio.h>
#include "y.tab.h"
%}
digit [0-9]
letter [a-zA-Z]
%%
if { return IF; }
else { return ELSE; }
while { return WHILE; }
{letter}({letter}|{digit})* { return ID; }
{digit}+ { return NUM; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"=" { return ASSIGN; }
"==" { return EQ; }
"<" { return LT; }
"<=" { return LE; }
"(" { return LPAREN; }
")" { return RPAREN; }
";" { return SEMI; }
[ \t\n]+ { /* ignore whitespace */ }
. { printf("Unexpected: %s\n", yytext); }
%%
How Lex Works
Lex Specification (.l file)
|
v
+----------+
| Lex | (Lex Compiler)
+----------+
|
v
lex.yy.c (C Source Code)
|
v
+----------+
| C Compiler|
+----------+
|
v
Lexical Analyzer (Executable)
Lex Rule Conflict Resolution
Rules when multiple patterns match simultaneously:
1. **Longest Match**: Selects the pattern with the longest match.
- If the input is `ifx`, it is recognized as the identifier `ifx`, not the keyword `if`.
2. **Rule Priority**: If lengths are equal, the rule listed first is selected.
- `if` is recognized as a keyword because the keyword rule appears before the identifier pattern.
8. Direct Conversion from Regular Expression to DFA
There is also a method to construct a DFA directly from a regular expression, bypassing Thompson construction and subset construction.
Key Functions
For node n of the regular expression's syntax tree, the following are computed:
- **nullable(n)**: Whether node n can generate the empty string
- **firstpos(n)**: Set of positions that can be the first position of strings generated by n
- **lastpos(n)**: Set of positions that can be the last position of strings generated by n
- **followpos(p)**: Set of positions that can follow position p
nullable rules:
epsilon node -> true
position i node -> false
cat(c1, c2) node -> nullable(c1) AND nullable(c2)
or(c1, c2) node -> nullable(c1) OR nullable(c2)
star(c1) node -> true
This method bypasses the intermediate NFA step, yielding an efficient DFA with fewer states directly.
Summary
| Concept | Key Content |
| --------------------- | -------------------------------------------------------------- |
| NFA | Nondeterministic, epsilon transitions, simple to construct |
| DFA | Deterministic, one transition per input, fast execution (O(n)) |
| Thompson Construction | Systematic conversion of regex to NFA |
| Subset Construction | Conversion of NFA to DFA (possible state explosion) |
| DFA Minimization | Constructing the equivalent minimum-state DFA |
| Lex | Automatic lexical analyzer generation from regex |
| Longest Match | Selects the longest pattern when multiple match |
Finite automata are both the theoretical foundation and the core implementation of lexical analysis. In the next article, we will move to the syntax analysis phase and explore Top-Down parsing and LL parsers.
현재 단락 (1/246)
Regular expressions are tools for **specifying** token patterns, and finite automata are execution m...