Skip to content

필사 모드: [Compiler] 05. Finite Automata and Lexical Analyzer Implementation

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Finite Automata and Lexical Analyzer Implementation

Regular expressions are tools for **specifying** token patterns, and finite automata are execution models for **recognizing** them. This article covers NFA, DFA, and the conversions between them.

1. Finite Automata Overview

Finite automata are mathematical models for recognizing strings. There are two types:

- **NFA (Nondeterministic Finite Automaton)**

- **DFA (Deterministic Finite Automaton)**

Both recognize regular languages and have the same expressive power. However, their implementation characteristics differ.

Regular Expression <=> NFA <=> DFA

- NFA: Simple to construct but complex to execute

- DFA: Complex to construct but fast to execute

2. NFA (Nondeterministic Finite Automaton)

An NFA is defined as a 5-tuple: (S, Sigma, delta, s0, F)

- **S**: A finite set of states

- **Sigma**: Input alphabet

- **delta**: Transition function (returns a **set of states** for a given state and input)

- **s0**: Start state

- **F**: Set of accepting states

Key Characteristics of NFA

1. **Multiple transitions** are possible for a single input.

2. **Epsilon transitions**: States can transition without consuming input.

NFA Example: `(a|b)*abb`

a

+--------+

| |

v a | b b

--> 0 -----> 1 -----> 2 -----> ((3))

|

+--b--+

| |

+<----+

Transition Table

State | a | b

-----+--------+--------

0 | {0, 1} | {0}

1 | - | {2}

2 | - | {3}

3 | - | -

From state 0, reading input `a` can go to either state 0 or state 1. This is what **nondeterministic** means.

String Acceptance by NFA

An NFA accepts an input string if **at least one path exists** from the start state to an accepting state.

Input: "aabb"

Possible paths:

0 -a-> 0 -a-> 0 -b-> 0 -b-> 0 (not accepting state)

0 -a-> 0 -a-> 1 -b-> 2 -b-> 3 (accepting state!) -> Accept

0 -a-> 1 (no transition for next a) -> Fail

3. DFA (Deterministic Finite Automaton)

A DFA is a special case of NFA with the following constraints:

1. For each state and each input symbol, there is **exactly one transition**.

2. **No epsilon transitions.**

DFA Example: `(a|b)*abb`

b

+--------+

| |

v a | b b

--> 0 -----> 1 -----> 2 -----> ((3))

^ | |

| b | a | a

+--------+ |

| |

+-----------------+

Transition Table

State | a | b

-----+-----+-----

0 | 1 | 0

1 | 1 | 2

2 | 1 | 3

3 | 1 | 0

In a DFA, the transition for each state and each input is **unique**. This makes execution simple and fast.

DFA Simulation Algorithm

public boolean simulate(DFA dfa, String input) {

int state = dfa.startState;

for (char ch : input.toCharArray()) {

state = dfa.transition[state][ch];

if (state == DFA.ERROR) {

return false; // Error state

}

}

return dfa.isAccepting(state);

}

DFA simulation takes **O(n)** time, proportional to input length.

4. Regular Expression to NFA: Thompson Construction

**Thompson's Construction** is a systematic method for converting a regular expression into an NFA.

Basic Building Blocks

1. Single character a:

(i) --a--> ((f))

2. Epsilon:

(i) --epsilon--> ((f))

Union: r | s

epsilon NFA(r) epsilon

+----------> (i_r) ...> (f_r) ----------+

| |

(i) ---+ +---> ((f))

| |

+----------> (i_s) ...> (f_s) ----------+

epsilon NFA(s) epsilon

Concatenation: rs

(i_r) --> NFA(r) --> (f_r/i_s) --> NFA(s) --> ((f_s))

The accepting state of r and the start state of s are merged.

Closure: r\*

epsilon

+-------------------+

| |

(i) --epsilon--> (i_r) --> NFA(r) --> (f_r) --epsilon--> ((f))

| ^

+-------------------epsilon------------------------------+

Thompson Construction Example: `(a|b)*abb`

Step-by-step construction:

Step 1: NFA for a

(2) --a--> (3)

Step 2: NFA for b

(4) --b--> (5)

Step 3: NFA for a|b

(1) --eps--> (2) --a--> (3) --eps--> (6)

(1) --eps--> (4) --b--> (5) --eps--> (6)

Step 4: NFA for (a|b)*

(0) --eps--> (1) --...--> (6) --eps--> (7)

(0) --eps--> (7)

(6) --eps--> (1)

Step 5: Complete (a|b)*abb

... --eps--> (7) --a--> (8) --b--> (9) --b--> ((10))

The characteristics of Thompson construction are:

- The number of NFA states is **at most twice** the length of the regular expression.

- Each state has **at most 2** outgoing transitions.

- There is **exactly one** start state and **exactly one** accepting state.

5. NFA to DFA: Subset Construction

**Subset Construction** is an algorithm that converts an NFA to an equivalent DFA.

Core Idea

Each DFA state is a **set** of NFA states. All states the NFA can be in simultaneously are represented as a single DFA state.

Required Operations

epsilon-closure(s): Set of all states reachable from state s via epsilon transitions only

epsilon-closure(T): Union of epsilon-closure for each state in state set T

move(T, a): Set of states reachable from state set T via transitions on input a

Algorithm

Add epsilon-closure(s0) to Dstates and mark as "unprocessed"

while (unprocessed state T exists in Dstates) do

Mark T as "processed"

for (each input symbol a) do

U = epsilon-closure(move(T, a))

if (U is not in Dstates)

Add U to Dstates and mark as "unprocessed"

Dtran[T, a] = U

end for

end while

Subset Construction Example

NFA states: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

NFA start: 0, NFA accept: 10

DFA state A = epsilon-closure(0) = {0, 1, 2, 4, 7}

move(A, a) = {3, 8}

epsilon-closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = B

move(A, b) = {5}

epsilon-closure({5}) = {1, 2, 4, 5, 6, 7} = C

DFA state B = {1, 2, 3, 4, 6, 7, 8}

move(B, a) = {3, 8} -> B

move(B, b) = {5, 9}

epsilon-closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} = D

DFA state C = {1, 2, 4, 5, 6, 7}

move(C, a) = {3, 8} -> B

move(C, b) = {5} -> C

DFA state D = {1, 2, 4, 5, 6, 7, 9}

move(D, a) = {3, 8} -> B

move(D, b) = {5, 10}

epsilon-closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} = E

DFA state E = {1, 2, 4, 5, 6, 7, 10} (contains 10 -> accepting state)

move(E, a) = {3, 8} -> B

move(E, b) = {5} -> C

Resulting DFA transition table:

State | a | b | Accept?

-----+-----+-----+--------

A | B | C | X

B | B | D | X

C | B | C | X

D | B | E | X

E | B | C | O

6. DFA Minimization

The DFA produced by subset construction may have unnecessary states. **DFA minimization** finds the equivalent DFA with the minimum number of states.

Hopcroft Algorithm

1. Initial partition: {accepting states}, {non-accepting states}

2. Repeat: examine each group, and within the same group,

separate states that transition to different groups for some input

3. Terminate when no more splits are possible

Minimization Example

Initial partition:

G1 = {A, B, C, D} (non-accepting)

G2 = {E} (accepting)

Checking input 'b':

A -> C (G1), B -> D (G1), C -> C (G1), D -> E (G2)

D goes to a different group, so split!

New partition:

G1 = {A, B, C}

G3 = {D}

G2 = {E}

Checking input 'b':

A -> C (G1), B -> D (G3), C -> C (G1)

Split B!

Final partition:

{A, C}, {B}, {D}, {E}

7. Lexical Analyzer Generation Using the Lex Tool

**Lex** (or Flex) is a tool that automatically generates a lexical analyzer when token patterns are described using regular expressions.

Lex Program Structure

Declarations

%%

Translation Rules

%%

Auxiliary Functions

Lex Program Example

%{

#include <stdio.h>

#include "y.tab.h"

%}

digit [0-9]

letter [a-zA-Z]

%%

if { return IF; }

else { return ELSE; }

while { return WHILE; }

{letter}({letter}|{digit})* { return ID; }

{digit}+ { return NUM; }

"+" { return PLUS; }

"-" { return MINUS; }

"*" { return TIMES; }

"/" { return DIVIDE; }

"=" { return ASSIGN; }

"==" { return EQ; }

"<" { return LT; }

"<=" { return LE; }

"(" { return LPAREN; }

")" { return RPAREN; }

";" { return SEMI; }

[ \t\n]+ { /* ignore whitespace */ }

. { printf("Unexpected: %s\n", yytext); }

%%

How Lex Works

Lex Specification (.l file)

|

v

+----------+

| Lex | (Lex Compiler)

+----------+

|

v

lex.yy.c (C Source Code)

|

v

+----------+

| C Compiler|

+----------+

|

v

Lexical Analyzer (Executable)

Lex Rule Conflict Resolution

Rules when multiple patterns match simultaneously:

1. **Longest Match**: Selects the pattern with the longest match.

- If the input is `ifx`, it is recognized as the identifier `ifx`, not the keyword `if`.

2. **Rule Priority**: If lengths are equal, the rule listed first is selected.

- `if` is recognized as a keyword because the keyword rule appears before the identifier pattern.

8. Direct Conversion from Regular Expression to DFA

There is also a method to construct a DFA directly from a regular expression, bypassing Thompson construction and subset construction.

Key Functions

For node n of the regular expression's syntax tree, the following are computed:

- **nullable(n)**: Whether node n can generate the empty string

- **firstpos(n)**: Set of positions that can be the first position of strings generated by n

- **lastpos(n)**: Set of positions that can be the last position of strings generated by n

- **followpos(p)**: Set of positions that can follow position p

nullable rules:

epsilon node -> true

position i node -> false

cat(c1, c2) node -> nullable(c1) AND nullable(c2)

or(c1, c2) node -> nullable(c1) OR nullable(c2)

star(c1) node -> true

This method bypasses the intermediate NFA step, yielding an efficient DFA with fewer states directly.

Summary

| Concept | Key Content |

| --------------------- | -------------------------------------------------------------- |

| NFA | Nondeterministic, epsilon transitions, simple to construct |

| DFA | Deterministic, one transition per input, fast execution (O(n)) |

| Thompson Construction | Systematic conversion of regex to NFA |

| Subset Construction | Conversion of NFA to DFA (possible state explosion) |

| DFA Minimization | Constructing the equivalent minimum-state DFA |

| Lex | Automatic lexical analyzer generation from regex |

| Longest Match | Selects the longest pattern when multiple match |

Finite automata are both the theoretical foundation and the core implementation of lexical analysis. In the next article, we will move to the syntax analysis phase and explore Top-Down parsing and LL parsers.

현재 단락 (1/246)

Regular expressions are tools for **specifying** token patterns, and finite automata are execution m...

작성 글자: 0원문 글자: 8,345작성 단락: 0/246