- Authors

- Name
- Youngju Kim
- @fjvbn20031
Lexical Analysis - Tokens, Patterns, and Regular Expressions
Lexical Analysis is the first phase of compilation. Let us examine the process of reading the character stream of source code and breaking it into meaningful units called tokens.
1. Role of the Lexical Analyzer
The lexical analyzer (Lexer, Scanner) performs the following tasks:
Source Program (Character Stream)
|
v
+------------------+
| Lexical Analyzer | <---> Symbol Table
+------------------+
|
v
Token Stream (passed to the parser)
Main Functions
- Token Recognition: Groups characters into token units.
- Whitespace/Comment Removal: Filters out characters unnecessary for the parser.
- Source Location Tracking: Records line and column numbers for error messages.
- Symbol Table Management: Registers identifiers in the symbol table.
Why Separate the Lexical Analyzer
- Simplification of Design: The parser does not need to worry about whitespace and comments.
- Improved Portability: Input device-dependent processing is isolated in the lexical analyzer.
- Efficiency: Specialized techniques (finite automata) for token recognition can be applied.
2. Tokens, Patterns, and Lexemes
These three concepts are the core of lexical analysis.
Token
A token is an abstract name representing a lexical unit of a programming language.
Pattern
A pattern describes the form of strings that can constitute a token.
Lexeme
A lexeme is the actual string in the source code that matches a pattern.
Example
C source code: printf("Total = %d\n", score);
+----------+--------------------------+---------------------+
| Token | Pattern | Lexeme Examples |
+----------+--------------------------+---------------------+
| id | starts with letter, | printf, score |
| | letter/digit combination | |
| num | one or more digits | 42, 3, 100 |
| literal | characters between " | "Total = %d\n" |
| lparen | ( | ( |
| rparen | ) | ) |
| comma | , | , |
| semi | ; | ; |
+----------+--------------------------+---------------------+
Types of Tokens
1. Keyword: if, else, while, return, ...
2. Identifier: user-defined names
3. Constant: integers, floating-point numbers, string literals
4. Operator: +, -, *, /, =, ==, !=, ...
5. Delimiter: (, ), {, }, ;, ,, ...
3. Token Attributes
Since multiple lexemes can match a single token name, additional information is needed.
Source code: position = initial + rate * 60
Token stream:
<id, "position">
<assign_op, >
<id, "initial">
<add_op, >
<id, "rate">
<mul_op, >
<num, 60>
The second element here is the attribute.
The attribute of an identifier is usually a pointer to a symbol table entry. The attribute of a number is its actual value.
4. Regular Expressions
Regular expressions (regex) are used to define token patterns.
Basic Operations
The three basic operations of regular expressions defined over an alphabet Sigma are:
1. Union: r | s
- Strings matching r or s
2. Concatenation: rs
- A string matching r followed by a string matching s
3. Closure: r*
- Zero or more repetitions of strings matching r
Basic Regular Expressions
Expression Description Match Examples
-------------------------------------------------
a The character 'a' itself "a"
epsilon Empty string ""
r | s r or s If r is "a", s is "b", then "a" or "b"
rs r followed by s "ab"
r* Zero or more repetitions of r "", "a", "aa", "aaa", ...
Extended Notation
Abbreviations for convenience:
r+ = r r* (one or more repetitions)
r? = r | epsilon (zero or one)
[a-z] = a | b | ... | z (character class)
[^a-z] = characters except a-z (negated character class)
. = any single character
Regular Expression Examples
// Integer literal
digit = [0-9]
digits = digit+
integer = digits
// Floating-point literal
fraction = . digits | epsilon
exponent = (E | e)(+ | - | epsilon) digits | epsilon
float_num = digits fraction exponent
// Identifier
letter = [a-zA-Z_]
identifier = letter (letter | digit)*
// String literal (characters except double quotes inside double quotes)
string_lit = " [^"]* "
5. Regular Definitions
To express complex regular expressions concisely, we assign names to regular expressions and reuse them.
// Regular definition for C identifiers
letter_ -> [a-zA-Z_]
digit -> [0-9]
id -> letter_ (letter_ | digit)*
// Regular definition for C integer constants
digit -> [0-9]
nonzero -> [1-9]
decimal -> nonzero digit*
octal_dig -> [0-7]
octal -> 0 octal_dig*
hex_dig -> [0-9a-fA-F]
hex -> 0 (x|X) hex_dig+
integer -> decimal | octal | hex
Note: Circular references are not allowed in regular definitions. Each name can only reference previously defined names.
6. Input Buffering
The lexical analyzer must read source code one character at a time, and buffering techniques are used to reduce I/O costs.
Double Buffering
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| E | = | M | * | C | * | * | 2 |eof| . | . | . | . | . | . | . |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
^ ^
| |
lexemeBegin forward
Buffer 1 Buffer 2
- Two buffers are used alternately to ensure continuous reading.
lexemeBegin: The start position of the current lexemeforward: The position of the character currently being examined- A sentinel character
eofis placed at the end of each buffer to simplify boundary checking.
Forward Pointer Movement with Sentinels
// Without sentinel (2 checks each time)
forward++;
if (forward >= bufferEnd) {
// Load next buffer
reloadBuffer();
}
if (*forward == EOF) {
// Handle end of input
terminate();
}
// With sentinel (usually 1 check)
forward++;
if (*forward == EOF) {
if (forward == bufferEnd) {
reloadBuffer(); // Sentinel at buffer end
forward = bufferStart;
} else {
terminate(); // Actual end of input
}
}
7. Token Recognition Examples
Let us examine the token recognition process for a simple language.
Relational Operator Recognition
relop -> < | <= | <> | > | >= | =
The recognition process is represented as a Transition Diagram:
start --'<'--> s1 --'='--> [return LE]
|
+--'>'--> [return NE]
|
+--other--> [return LT] (retract)
start --'='--> [return EQ]
start --'>'--> s2 --'='--> [return GE]
|
+--other--> [return GT] (retract)
retract means moving the forward pointer back one position. For example, if a character that is neither = nor > follows <, that character must be treated as part of the next token.
Identifier and Keyword Recognition
start --letter--> s1 --letter|digit--> s1
|
+--other--> [check keyword table then return]
After recognizing an identifier, the keyword table is searched to check if it is a keyword. Reserved words like if, else, and while have the same form as identifiers and require separate handling.
// Keyword table initialization
Map<String, TokenType> keywords = new HashMap<>();
keywords.put("if", TokenType.IF);
keywords.put("else", TokenType.ELSE);
keywords.put("while", TokenType.WHILE);
keywords.put("return", TokenType.RETURN);
// After recognizing an identifier
String lexeme = getLexeme();
if (keywords.containsKey(lexeme)) {
return new Token(keywords.get(lexeme), lexeme);
} else {
return new Token(TokenType.ID, lexeme);
}
8. Implementing a Simple Lexical Analyzer
Let us implement a simple lexical analyzer in Java.
public class SimpleLexer {
private String input;
private int pos;
public SimpleLexer(String input) {
this.input = input;
this.pos = 0;
}
public Token nextToken() {
skipWhitespace();
if (pos >= input.length()) {
return new Token(TokenType.EOF, "");
}
char ch = input.charAt(pos);
// Digit: integer literal recognition
if (Character.isDigit(ch)) {
return readNumber();
}
// Letter: identifier or keyword recognition
if (Character.isLetter(ch) || ch == '_') {
return readIdentifier();
}
// Operators and delimiters
switch (ch) {
case '+': pos++; return new Token(TokenType.PLUS, "+");
case '-': pos++; return new Token(TokenType.MINUS, "-");
case '*': pos++; return new Token(TokenType.STAR, "*");
case '/': pos++; return new Token(TokenType.SLASH, "/");
case '=':
if (peek() == '=') {
pos += 2;
return new Token(TokenType.EQ, "==");
}
pos++;
return new Token(TokenType.ASSIGN, "=");
case '(': pos++; return new Token(TokenType.LPAREN, "(");
case ')': pos++; return new Token(TokenType.RPAREN, ")");
case ';': pos++; return new Token(TokenType.SEMI, ";");
default:
throw new RuntimeException(
"Unexpected character: " + ch + " at position " + pos
);
}
}
private Token readNumber() {
int start = pos;
while (pos < input.length() && Character.isDigit(input.charAt(pos))) {
pos++;
}
return new Token(TokenType.NUM, input.substring(start, pos));
}
private Token readIdentifier() {
int start = pos;
while (pos < input.length() &&
(Character.isLetterOrDigit(input.charAt(pos)) ||
input.charAt(pos) == '_')) {
pos++;
}
String lexeme = input.substring(start, pos);
TokenType type = Keywords.lookup(lexeme);
return new Token(type != null ? type : TokenType.ID, lexeme);
}
private void skipWhitespace() {
while (pos < input.length() && Character.isWhitespace(input.charAt(pos))) {
pos++;
}
}
private char peek() {
return (pos + 1 < input.length()) ? input.charAt(pos + 1) : '\0';
}
}
Execution Example
Input: "x = 10 + y;"
Token stream:
<ID, "x">
<ASSIGN, "=">
<NUM, "10">
<PLUS, "+">
<ID, "y">
<SEMI, ";">
<EOF, "">
9. Error Handling
Error handling strategies when the lexical analyzer encounters a character that does not match any pattern:
- Panic Mode: Skips characters until a matching pattern is found.
- Character Deletion: Deletes the incorrect character and continues.
- Character Insertion: Inserts a missing character (e.g., matching quote pairs).
- Character Replacement: Replaces an incorrect character with a correct one.
In practice, compilers continue analysis after recovery to report as many errors as possible at once.
Summary
| Concept | Key Content |
|---|---|
| Token | Abstract name for a lexical unit (id, num, relop, etc.) |
| Pattern | Form of strings that constitute a token (regex) |
| Lexeme | Actual string that matches a pattern |
| Regular Expression | Three basic operations: union, concatenation, closure |
| Regular Definition | Naming regular expressions for reuse |
| Input Buffering | Efficient I/O with double buffers and sentinels |
| Transition Diagram | Token recognition process expressed as state transitions |
Lexical analysis is the most fundamental phase of compilation, but it accounts for a significant portion of total compilation time, so efficient implementation is important. In the next article, we will cover finite automata that automatically generate lexical analyzers based on regular expressions.