- Published on
Regular Expressions Complete Guide 2025: From Basics to Advanced Patterns and Real-World Applications
- Authors

- Name
- Youngju Kim
- @fjvbn20031
1. Why Regex Still Matters
In 2025, regular expressions remain a core developer tool.
Input validation — The most concise way to validate email, URL, phone number, and credit card formats.
Log analysis — Extract specific patterns (error codes, IP addresses, timestamps) from multi-GB log files.
Text transformation — Used for code refactoring, data cleansing, and file format conversion.
AI prompt preprocessing — Remove unnecessary special characters and HTML tags before feeding input to LLMs.
Universal support — Built into virtually every programming language: JavaScript, Python, Java, Go, Rust, and more.
2. Basics: Literals and Metacharacters
Literal Matching
The most basic form of regex is literal matching — characters match themselves.
import re
# Literal matching
print(re.findall(r'hello', 'hello world hello'))
# ['hello', 'hello']
Core Metacharacters
| Metachar | Meaning | Example | Matches |
|---|---|---|---|
. | Any single char (except newline) | h.t | hat, hit, hot |
^ | Start of string | ^Hello | Hello in "Hello world" |
$ | End of string | world$ | world in "Hello world" |
* | 0 or more repetitions | ab*c | ac, abc, abbc |
+ | 1 or more repetitions | ab+c | abc, abbc (not ac) |
? | 0 or 1 occurrence | colou?r | color, colour |
| | OR (alternation) | cat|dog | cat or dog |
[] | Character class | [aeiou] | one vowel |
() | Grouping / Capture | (ab)+ | ab, abab |
\ | Escape | \. | literal period |
Characters That Need Escaping
These metacharacters must be preceded by \ when used as literals:
. * + ? ^ $ | [ ] ( ) { } \
# Match a literal period
print(re.findall(r'www\.example\.com', 'visit www.example.com'))
# ['www.example.com']
3. Character Classes
Basic Character Classes
# [abc] — one of a, b, or c
re.findall(r'[aeiou]', 'hello world')
# ['e', 'o', 'o']
# [^abc] — any character NOT a, b, or c
re.findall(r'[^aeiou ]', 'hello world')
# ['h', 'l', 'l', 'w', 'r', 'l', 'd']
# [a-z] — range
re.findall(r'[A-Z]', 'Hello World')
# ['H', 'W']
# [0-9a-fA-F] — hex characters
re.findall(r'[0-9a-fA-F]+', 'color: #FF00AB')
# ['FF00AB']
Shorthand Character Classes
| Shorthand | Equivalent | Meaning |
|---|---|---|
\d | [0-9] | Digit |
\D | [^0-9] | Non-digit |
\w | [a-zA-Z0-9_] | Word character |
\W | [^a-zA-Z0-9_] | Non-word character |
\s | [ \t\n\r\f\v] | Whitespace |
\S | [^ \t\n\r\f\v] | Non-whitespace |
# Extract phone numbers
text = "Call 555-1234 or 800-555-6789"
print(re.findall(r'\d{3}-\d{3,4}(?:-\d{4})?', text))
# ['555-1234', '800-555-6789']
POSIX Character Classes (Selected Languages)
[:alpha:] — Alphabetic characters
[:digit:] — Digits
[:alnum:] — Alphanumeric
[:space:] — Whitespace
[:upper:] — Uppercase letters
[:lower:] — Lowercase letters
[:punct:] — Punctuation
POSIX classes are mainly used in UNIX tools like grep and sed. In Python and JavaScript, use \d, \w, etc.
4. Quantifiers
Basic Quantifiers
# * (0 or more)
re.findall(r'go*d', 'gd god good goood')
# ['gd', 'god', 'good', 'goood']
# + (1 or more)
re.findall(r'go+d', 'gd god good goood')
# ['god', 'good', 'goood']
# ? (0 or 1)
re.findall(r'colou?r', 'color colour')
# ['color', 'colour']
Exact Count Specifiers
# {n} — exactly n times
re.findall(r'\d{4}', '2025 is the year 12345')
# ['2025', '1234']
# {n,m} — between n and m times
re.findall(r'\d{2,4}', '1 12 123 1234 12345')
# ['12', '123', '1234', '1234']
# {n,} — n or more times
re.findall(r'\d{3,}', '1 12 123 1234')
# ['123', '1234']
Greedy vs Lazy Matching
By default, quantifiers are greedy — they match as much as possible. Appending ? makes them lazy.
text = '<b>bold</b> and <i>italic</i>'
# Greedy (default)
print(re.findall(r'<.*>', text))
# ['<b>bold</b> and <i>italic</i>'] — matches everything!
# Lazy (add ?)
print(re.findall(r'<.*?>', text))
# ['<b>', '</b>', '<i>', '</i>'] — matches each tag
Possessive Quantifiers
Possessive quantifiers (*+, ++, ?+) never backtrack. Supported in Java and PCRE (not standard Python re; available in the regex module).
// Java example
// Possessive quantifier — no backtracking
String pattern = "a++b"; // Consumes all 'a's, never gives back
5. Groups
Capturing Groups
# Basic capturing group
text = "2025-03-23"
match = re.match(r'(\d{4})-(\d{2})-(\d{2})', text)
if match:
print(match.group(0)) # '2025-03-23' (full match)
print(match.group(1)) # '2025' (year)
print(match.group(2)) # '03' (month)
print(match.group(3)) # '23' (day)
Non-Capturing Groups
(?:...) — Groups without capturing. Provides a performance benefit.
# Non-capturing group
text = "http://example.com https://secure.com"
print(re.findall(r'(?:https?://)(\S+)', text))
# ['example.com', 'secure.com'] — captures only the domain
Named Groups
# Named groups
text = "2025-03-23"
match = re.match(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text)
if match:
print(match.group('year')) # '2025'
print(match.group('month')) # '03'
print(match.group('day')) # '23'
Backreferences
# Find duplicate words
text = "the the quick brown fox fox"
print(re.findall(r'\b(\w+)\s+\1\b', text))
# ['the', 'fox']
# Match HTML tags (same tag name)
html = "<div>content</div> <span>text</span>"
print(re.findall(r'<(\w+)>.*?</\1>', html))
# ['div', 'span']
6. Anchors and Boundaries
Basic Anchors
# ^ — start of string
print(re.findall(r'^Hello', 'Hello World\nHello Again'))
# ['Hello'] — only the first
# $ — end of string
print(re.findall(r'end$', 'start to end'))
# ['end']
# Multiline mode
print(re.findall(r'^Hello', 'Hello World\nHello Again', re.MULTILINE))
# ['Hello', 'Hello'] — start of each line
Word Boundary (\b)
\b matches the position between a word character and a non-word character. It is zero-width and consumes no characters.
text = "cat catfish concatenate scattered"
# Without \b
print(re.findall(r'cat', text))
# ['cat', 'cat', 'cat', 'cat']
# With \b for exact word
print(re.findall(r'\bcat\b', text))
# ['cat']
# Word start only
print(re.findall(r'\bcat\w*', text))
# ['cat', 'catfish', 'concatenate']
7. Lookahead and Lookbehind
Lookahead and lookbehind are zero-width assertions. They check conditions without consuming characters.
Four Types
| Type | Syntax | Meaning |
|---|---|---|
| Positive Lookahead | (?=...) | Position followed by ... |
| Negative Lookahead | (?!...) | Position NOT followed by ... |
| Positive Lookbehind | (?<=...) | Position preceded by ... |
| Negative Lookbehind | (?<!...) | Position NOT preceded by ... |
Practical Example: Password Validation
def validate_password(password):
"""
Password rules:
- At least 8 characters
- At least 1 uppercase letter
- At least 1 lowercase letter
- At least 1 digit
- At least 1 special character
"""
pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$'
return bool(re.match(pattern, password))
print(validate_password("Abc123!@")) # True
print(validate_password("abc123")) # False (no uppercase, no special)
print(validate_password("Short1!")) # False (under 8 chars)
Number Formatting (Lookbehind)
# Add commas to numbers (thousands separator)
def format_number(n):
return re.sub(r'(?<=\d)(?=(\d{3})+(?!\d))', ',', str(n))
print(format_number(1234567890))
# '1,234,567,890'
Negative Lookahead Usage
# Find .js files that are NOT .min.js
files = ["app.js", "app.min.js", "utils.js", "vendor.min.js"]
pattern = r'^(?!.*\.min\.js$).*\.js$'
for f in files:
if re.match(pattern, f):
print(f)
# app.js
# utils.js
8. 30 Practical Patterns
Email Patterns
# 1. Basic email validation
email_basic = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
# 2. Extract domain from email
email_domain = r'@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'
# 3. Match Gmail addresses only
gmail_only = r'^[a-zA-Z0-9._%+-]+@gmail\.com$'
URL Patterns
# 4. Basic URL matching
url_basic = r'https?://[a-zA-Z0-9.-]+(?:/[^\s]*)?'
# 5. Extract domain from URL
url_domain = r'https?://([^/\s]+)'
# 6. Extract query parameters
query_param = r'[?&]([^=&]+)=([^&]*)'
IP Addresses
# 7. IPv4 address
ipv4 = r'\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b'
# 8. Private IP addresses
private_ip = r'\b(?:10\.\d{1,3}\.\d{1,3}\.\d{1,3}|172\.(?:1[6-9]|2\d|3[01])\.\d{1,3}\.\d{1,3}|192\.168\.\d{1,3}\.\d{1,3})\b'
Phone Numbers
# 9. US phone number
us_phone = r'(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
# 10. International E.164 format
intl_phone = r'\+\d{1,3}\d{4,14}'
# 11. UK phone number
uk_phone = r'(?:\+44|0)\d{4}[\s-]?\d{6}'
Date and Time
# 12. YYYY-MM-DD
date_iso = r'\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])'
# 13. Multiple date formats
date_multi = r'\d{4}[/.-]\d{1,2}[/.-]\d{1,2}'
# 14. 24-hour time
time_24h = r'(?:[01]\d|2[0-3]):[0-5]\d(?::[0-5]\d)?'
# 15. ISO 8601 datetime
iso_datetime = r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z|[+-]\d{2}:\d{2})'
Development Patterns
# 16. HTML tags
html_tag = r'<\/?[a-zA-Z][a-zA-Z0-9]*(?:\s[^>]*)?\/?>'
# 17. HEX color codes
hex_color = r'#(?:[0-9a-fA-F]{3}|[0-9a-fA-F]{6})\b'
# 18. Semantic versioning (SemVer)
semver = r'\bv?(\d+)\.(\d+)\.(\d+)(?:-([\w.]+))?(?:\+([\w.]+))?\b'
# 19. JSON keys
json_key = r'"([^"\\]*)"\s*:'
# 20. CSS class selectors
css_class = r'\.[a-zA-Z_][\w-]*'
CJK Characters
# 21. Korean characters (Hangul)
hangul = r'[\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F]+'
# 22. Japanese Hiragana
hiragana = r'[\u3040-\u309F]+'
# 23. Japanese Katakana
katakana = r'[\u30A0-\u30FF]+'
# 24. CJK Unified Ideographs (Kanji/Hanzi)
cjk = r'[\u4E00-\u9FFF]+'
# 25. Any CJK character
cjk_all = r'[\u3000-\u9FFF\uAC00-\uD7AF]+'
Security and Validation
# 26. Strong password
strong_password = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$'
# 27. SQL injection suspicious patterns
sql_injection = r"(?i)(?:union\s+select|or\s+1\s*=\s*1|drop\s+table|insert\s+into)"
# 28. XSS suspicious patterns
xss_pattern = r'(?i)<script[^>]*>.*?</script>'
Other Useful Patterns
# 29. Normalize whitespace (collapse multiple spaces)
text = "too many spaces"
print(re.sub(r'\s+', ' ', text))
# 'too many spaces'
# 30. CSV field extraction (comma-separated, quoted support)
csv_field = r'(?:"([^"]*)"|([^,]*))'
9. Language-Specific Differences
JavaScript
// Literal syntax
const re1 = /hello/gi;
// Constructor syntax
const re2 = new RegExp('hello', 'gi');
// Key methods
'hello world'.match(/hello/); // ['hello']
'hello world'.replace(/hello/, 'hi'); // 'hi world'
/hello/.test('hello world'); // true
// Named Groups (ES2018+)
const match = '2025-03-23'.match(
/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
);
console.log(match.groups.year); // '2025'
// matchAll (ES2020+)
const text = 'cat bat sat';
for (const m of text.matchAll(/[a-z]at/g)) {
console.log(m[0], m.index);
}
// Lookbehind (ES2018+)
'$100 200 $300'.match(/(?<=\$)\d+/g); // ['100', '300']
Python
import re
# Core methods
re.match(r'^hello', 'hello world') # Match at start only
re.search(r'hello', 'say hello') # Match anywhere
re.findall(r'\d+', 'a1 b2 c3') # All matches as list
re.sub(r'\d+', 'N', 'a1 b2') # Substitution: 'aN bN'
# Compilation (performance boost for repeated use)
pattern = re.compile(r'\d{3}-\d{4}')
pattern.findall('555-1234-5678')
# Flags
re.IGNORECASE # Case-insensitive
re.MULTILINE # ^ and $ match line boundaries
re.DOTALL # . matches newlines too
re.VERBOSE # Allow comments and whitespace
# VERBOSE example
phone_re = re.compile(r'''
(\d{3}) # area code
[-.\s]? # separator
(\d{3,4}) # middle digits
[-.\s]? # separator
(\d{4}) # last digits
''', re.VERBOSE)
Java
import java.util.regex.*;
// Basic usage
Pattern pattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2}");
Matcher matcher = pattern.matcher("Date: 2025-03-23");
if (matcher.find()) {
System.out.println(matcher.group()); // "2025-03-23"
}
// Named Groups
Pattern datePattern = Pattern.compile(
"(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})"
);
Matcher m = datePattern.matcher("2025-03-23");
if (m.find()) {
System.out.println(m.group("year")); // "2025"
}
// String methods
"hello world".matches("hello.*"); // true
"a1b2c3".replaceAll("\\d", "N"); // "aNbNcN"
"a,b,,c".split(",", -1); // ["a", "b", "", "c"]
Go
package main
import (
"fmt"
"regexp"
)
func main() {
// Compile
re := regexp.MustCompile(`\d{4}-\d{2}-\d{2}`)
// Match check
fmt.Println(re.MatchString("2025-03-23")) // true
// Find
fmt.Println(re.FindString("Date: 2025-03-23")) // "2025-03-23"
// Find all
fmt.Println(re.FindAllString("2025-03-23 and 2025-12-31", -1))
// Named Groups
re2 := regexp.MustCompile(
`(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})`,
)
match := re2.FindStringSubmatch("2025-03-23")
for i, name := range re2.SubexpNames() {
if name != "" {
fmt.Printf("%s: %s\n", name, match[i])
}
}
}
Feature Comparison
| Feature | JavaScript | Python | Java | Go |
|---|---|---|---|---|
| Lookahead | Yes (ES2018+) | Yes | Yes | No |
| Lookbehind | Yes (ES2018+) | Yes | Yes | No |
| Named Groups | Yes (ES2018+) | Yes | Yes | Yes |
| Possessive | No | No (regex module) | Yes | No |
| Atomic Groups | No | No (regex module) | Yes | Yes |
| Unicode Categories | Yes | Yes | Yes | Yes |
| Recursive Patterns | No | No (regex module) | No | No |
| VERBOSE Mode | No | Yes | Yes (COMMENTS) | No |
10. Performance Optimization and ReDoS Prevention
Understanding Backtracking
NFA regex engines use backtracking to try alternative paths when a match fails.
# Normal backtracking
# Pattern: a*b
# Input: "aaac"
# Tries: "aaa" + b? fail -> "aa" + b? fail -> "a" + b? fail -> "" + b? fail
# 4 backtracks total — no problem
Catastrophic Backtracking
import time
# Dangerous pattern!
dangerous_pattern = r'(a+)+b'
safe_pattern = r'a+b'
# Short input — both fast
text_short = 'a' * 10 + 'c'
# Long input — dangerous pattern takes exponential time
text_long = 'a' * 25 + 'c'
start = time.time()
re.match(safe_pattern, text_long)
print(f"Safe pattern: {time.time() - start:.4f}s")
# WARNING: The following may take a very long time!
# start = time.time()
# re.match(dangerous_pattern, text_long)
# print(f"Dangerous pattern: {time.time() - start:.4f}s")
Vulnerable ReDoS Patterns
# 1. Nested quantifiers
r'(a+)+' # Dangerous!
r'(a*)*' # Dangerous!
r'(a+)*' # Dangerous!
# 2. Overlapping alternatives
r'(a|a)+' # Dangerous!
r'(\d+|\d+\.)+' # Dangerous!
# 3. Safe replacements
r'a+' # Instead of (a+)+
r'a*' # Instead of (a*)*
r'\d+\.?' # Instead of (\d+|\d+\.)+
ReDoS Prevention Guidelines
- No nested quantifiers — Avoid
(a+)+,(a*)*patterns - Eliminate overlapping alternatives — Ensure OR alternatives do not match the same characters
- Use atomic groups — Use
(?>...)in engines that support it - Limit input length — Validate input length before applying regex
- Set timeouts — Put time limits on regex execution
# Timeout in Python
import signal
def timeout_handler(signum, frame):
raise TimeoutError("Regex timeout!")
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1) # 1 second timeout
try:
re.match(r'(a+)+b', 'a' * 30 + 'c')
except TimeoutError:
print("Regex execution timed out.")
finally:
signal.alarm(0)
11. Tools
regex101.com
The most popular online regex tester.
Key features:
- Real-time match highlighting
- Auto-generated pattern explanation
- Supports JavaScript, Python, Go, Java, and more
- Step-by-step match debugger
- Save and share patterns
grep and ripgrep
# grep basics
grep -E 'error|warning' /var/log/syslog
grep -P '(?<=user:)\w+' access.log # Perl-compatible regex
# ripgrep (rg) — faster alternative
rg 'TODO|FIXME' --type py
rg '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' access.log
rg -e 'pattern1' -e 'pattern2' . # Multiple patterns
sed and awk
# sed — stream editor
# Convert date format: 2025/03/23 -> 2025-03-23
echo "2025/03/23" | sed 's/\//-/g'
# Mask email address
echo "user@example.com" | sed 's/\(.\).*@/\1***@/'
# u***@example.com
# awk — pattern matching + processing
# Print only 200 status code logs
awk '/HTTP\/[0-9.]+" 200/' access.log
# Count requests per IP
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head
IDE Usage
Most IDEs (VS Code, IntelliJ, Vim, etc.) support regex search and replace.
VS Code tips:
Ctrl+H(orCmd+H) for replace mode- Click the regex icon (
.*) to enable - Use
$1,$2for backreferences in replacement
12. Interview Questions
Q1. What is the difference between .* and .*??
.* is greedy — it matches as many characters as possible. .*? is lazy — it matches as few characters as possible.
Q2. How do ^ and $ behave differently in multiline mode?
In default mode, ^ matches the start of the string and $ matches the end. In multiline mode (re.MULTILINE), they match the start and end of each line.
Q3. What is the difference between lookahead and lookbehind?
Lookahead (?=...) checks the pattern after the current position, while lookbehind (?<=...) checks the pattern before it. Both are zero-width assertions that do not consume characters.
Q4. What is ReDoS and how do you prevent it?
ReDoS is when an inefficient regex causes exponential backtracking, consuming excessive CPU. Prevent it by avoiding nested quantifiers, limiting input length, and setting execution timeouts.
Q5. What is the difference between capturing and non-capturing groups?
Capturing groups (...) store the matched string for later reference. Non-capturing groups (?:...) group without storing, providing a slight performance benefit.
Q6. What does \b do?
It matches a word boundary — the position between a word character (\w) and a non-word character, or between a word character and the start/end of the string.
Q7. What are the limitations of using regex for email validation?
Fully implementing the RFC 5322 standard requires an extremely complex pattern. In practice, use regex for basic format validation only and verify actual existence through confirmation emails.
Q8. What is the difference between re.match and re.search in Python?
re.match only attempts matching at the start of the string, while re.search scans the entire string for the first match.
Q9. How do you match a literal backslash in regex?
Use \\\\ in the pattern (double escaping for both string and regex). In Python, raw strings (r'\\') require only \\.
Q10. Why should you not parse HTML with regex?
HTML is a context-free language, not a regular language. Nested tags, comments, CDATA sections, and other features cannot be correctly handled by regex. Use an HTML parser (BeautifulSoup, Cheerio, etc.) instead.
13. Quiz
Q1. What does the pattern \d{3}-\d{4} match in "010-1234-5678"?
It matches "010-1234" (the first occurrence). Using re.findall would find both "010-1234" and "234-5678". To match the full phone number, use \d{3}-\d{4}-\d{4}.
Q2. Why is (a+)+b a dangerous pattern?
When the input does not end with 'b' (e.g., "aaaaac"), the engine tries every combination of inner and outer group repetitions, causing exponential backtracking. With just 25 'a' characters, millions of attempts are needed. Simply replacing it with a+b solves the problem.
Q3. What is the difference between (?:...) and (?=...)?
(?:...) is a non-capturing group that consumes characters while grouping (it just does not capture). (?=...) is a lookahead that does not consume characters at all — it is a zero-width assertion that only checks a condition.
Q4. Why can Go not use lookahead?
Go's regexp package uses the RE2 engine, which guarantees linear time execution by not using backtracking. Lookahead requires backtracking, so RE2 does not support it.
Q5. Which of the following patterns match "abc"? a) [abc]+ b) [^abc]+ c) a.c d) a\bc
a) [abc]+ — matches all of "abc". b) [^abc]+ — matches characters NOT in a, b, c, so it fails. c) a.c — the . matches b, so it matches "abc". d) a\bc — \b is a word boundary, but there is no word boundary between 'a' and 'b', so it fails. The answers are a and c.
14. References
- Mastering Regular Expressions (Jeffrey Friedl) — The definitive regex book
- Regular-Expressions.info — https://www.regular-expressions.info/
- regex101 — https://regex101.com/ — Online regex tester
- Regexr — https://regexr.com/ — Visual regex learning
- MDN Web Docs: Regular Expressions — JavaScript regex reference
- Python re module docs — https://docs.python.org/3/library/re.html
- RE2 Syntax — https://github.com/google/re2/wiki/Syntax
- ReDoS Prevention — OWASP Guide
- ripgrep (rg) — https://github.com/BurntSushi/ripgrep
- Debuggex — https://www.debuggex.com/ — Regex visualization
- RegExp Playground — https://regexper.com/ — Railroad diagrams
- Automata Theory — Regular languages and finite automata