Skip to content
Published on

Regular Expressions Complete Guide 2025: From Basics to Advanced Patterns and Real-World Applications

Authors

1. Why Regex Still Matters

In 2025, regular expressions remain a core developer tool.

Input validation — The most concise way to validate email, URL, phone number, and credit card formats.

Log analysis — Extract specific patterns (error codes, IP addresses, timestamps) from multi-GB log files.

Text transformation — Used for code refactoring, data cleansing, and file format conversion.

AI prompt preprocessing — Remove unnecessary special characters and HTML tags before feeding input to LLMs.

Universal support — Built into virtually every programming language: JavaScript, Python, Java, Go, Rust, and more.


2. Basics: Literals and Metacharacters

Literal Matching

The most basic form of regex is literal matching — characters match themselves.

import re

# Literal matching
print(re.findall(r'hello', 'hello world hello'))
# ['hello', 'hello']

Core Metacharacters

MetacharMeaningExampleMatches
.Any single char (except newline)h.that, hit, hot
^Start of string^HelloHello in "Hello world"
$End of stringworld$world in "Hello world"
*0 or more repetitionsab*cac, abc, abbc
+1 or more repetitionsab+cabc, abbc (not ac)
?0 or 1 occurrencecolou?rcolor, colour
|OR (alternation)cat|dogcat or dog
[]Character class[aeiou]one vowel
()Grouping / Capture(ab)+ab, abab
\Escape\.literal period

Characters That Need Escaping

These metacharacters must be preceded by \ when used as literals:

. * + ? ^ $ | [ ] ( ) { } \
# Match a literal period
print(re.findall(r'www\.example\.com', 'visit www.example.com'))
# ['www.example.com']

3. Character Classes

Basic Character Classes

# [abc] — one of a, b, or c
re.findall(r'[aeiou]', 'hello world')
# ['e', 'o', 'o']

# [^abc] — any character NOT a, b, or c
re.findall(r'[^aeiou ]', 'hello world')
# ['h', 'l', 'l', 'w', 'r', 'l', 'd']

# [a-z] — range
re.findall(r'[A-Z]', 'Hello World')
# ['H', 'W']

# [0-9a-fA-F] — hex characters
re.findall(r'[0-9a-fA-F]+', 'color: #FF00AB')
# ['FF00AB']

Shorthand Character Classes

ShorthandEquivalentMeaning
\d[0-9]Digit
\D[^0-9]Non-digit
\w[a-zA-Z0-9_]Word character
\W[^a-zA-Z0-9_]Non-word character
\s[ \t\n\r\f\v]Whitespace
\S[^ \t\n\r\f\v]Non-whitespace
# Extract phone numbers
text = "Call 555-1234 or 800-555-6789"
print(re.findall(r'\d{3}-\d{3,4}(?:-\d{4})?', text))
# ['555-1234', '800-555-6789']

POSIX Character Classes (Selected Languages)

[:alpha:]Alphabetic characters
[:digit:]Digits
[:alnum:]Alphanumeric
[:space:]Whitespace
[:upper:]Uppercase letters
[:lower:]Lowercase letters
[:punct:]Punctuation

POSIX classes are mainly used in UNIX tools like grep and sed. In Python and JavaScript, use \d, \w, etc.


4. Quantifiers

Basic Quantifiers

# * (0 or more)
re.findall(r'go*d', 'gd god good goood')
# ['gd', 'god', 'good', 'goood']

# + (1 or more)
re.findall(r'go+d', 'gd god good goood')
# ['god', 'good', 'goood']

# ? (0 or 1)
re.findall(r'colou?r', 'color colour')
# ['color', 'colour']

Exact Count Specifiers

# {n} — exactly n times
re.findall(r'\d{4}', '2025 is the year 12345')
# ['2025', '1234']

# {n,m} — between n and m times
re.findall(r'\d{2,4}', '1 12 123 1234 12345')
# ['12', '123', '1234', '1234']

# {n,} — n or more times
re.findall(r'\d{3,}', '1 12 123 1234')
# ['123', '1234']

Greedy vs Lazy Matching

By default, quantifiers are greedy — they match as much as possible. Appending ? makes them lazy.

text = '<b>bold</b> and <i>italic</i>'

# Greedy (default)
print(re.findall(r'<.*>', text))
# ['<b>bold</b> and <i>italic</i>']  — matches everything!

# Lazy (add ?)
print(re.findall(r'<.*?>', text))
# ['<b>', '</b>', '<i>', '</i>']  — matches each tag

Possessive Quantifiers

Possessive quantifiers (*+, ++, ?+) never backtrack. Supported in Java and PCRE (not standard Python re; available in the regex module).

// Java example
// Possessive quantifier — no backtracking
String pattern = "a++b";  // Consumes all 'a's, never gives back

5. Groups

Capturing Groups

# Basic capturing group
text = "2025-03-23"
match = re.match(r'(\d{4})-(\d{2})-(\d{2})', text)
if match:
    print(match.group(0))  # '2025-03-23' (full match)
    print(match.group(1))  # '2025' (year)
    print(match.group(2))  # '03' (month)
    print(match.group(3))  # '23' (day)

Non-Capturing Groups

(?:...) — Groups without capturing. Provides a performance benefit.

# Non-capturing group
text = "http://example.com https://secure.com"
print(re.findall(r'(?:https?://)(\S+)', text))
# ['example.com', 'secure.com']  — captures only the domain

Named Groups

# Named groups
text = "2025-03-23"
match = re.match(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text)
if match:
    print(match.group('year'))   # '2025'
    print(match.group('month'))  # '03'
    print(match.group('day'))    # '23'

Backreferences

# Find duplicate words
text = "the the quick brown fox fox"
print(re.findall(r'\b(\w+)\s+\1\b', text))
# ['the', 'fox']

# Match HTML tags (same tag name)
html = "<div>content</div> <span>text</span>"
print(re.findall(r'<(\w+)>.*?</\1>', html))
# ['div', 'span']

6. Anchors and Boundaries

Basic Anchors

# ^ — start of string
print(re.findall(r'^Hello', 'Hello World\nHello Again'))
# ['Hello']  — only the first

# $ — end of string
print(re.findall(r'end$', 'start to end'))
# ['end']

# Multiline mode
print(re.findall(r'^Hello', 'Hello World\nHello Again', re.MULTILINE))
# ['Hello', 'Hello']  — start of each line

Word Boundary (\b)

\b matches the position between a word character and a non-word character. It is zero-width and consumes no characters.

text = "cat catfish concatenate scattered"

# Without \b
print(re.findall(r'cat', text))
# ['cat', 'cat', 'cat', 'cat']

# With \b for exact word
print(re.findall(r'\bcat\b', text))
# ['cat']

# Word start only
print(re.findall(r'\bcat\w*', text))
# ['cat', 'catfish', 'concatenate']

7. Lookahead and Lookbehind

Lookahead and lookbehind are zero-width assertions. They check conditions without consuming characters.

Four Types

TypeSyntaxMeaning
Positive Lookahead(?=...)Position followed by ...
Negative Lookahead(?!...)Position NOT followed by ...
Positive Lookbehind(?<=...)Position preceded by ...
Negative Lookbehind(?<!...)Position NOT preceded by ...

Practical Example: Password Validation

def validate_password(password):
    """
    Password rules:
    - At least 8 characters
    - At least 1 uppercase letter
    - At least 1 lowercase letter
    - At least 1 digit
    - At least 1 special character
    """
    pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$'
    return bool(re.match(pattern, password))

print(validate_password("Abc123!@"))   # True
print(validate_password("abc123"))     # False (no uppercase, no special)
print(validate_password("Short1!"))    # False (under 8 chars)

Number Formatting (Lookbehind)

# Add commas to numbers (thousands separator)
def format_number(n):
    return re.sub(r'(?<=\d)(?=(\d{3})+(?!\d))', ',', str(n))

print(format_number(1234567890))
# '1,234,567,890'

Negative Lookahead Usage

# Find .js files that are NOT .min.js
files = ["app.js", "app.min.js", "utils.js", "vendor.min.js"]
pattern = r'^(?!.*\.min\.js$).*\.js$'
for f in files:
    if re.match(pattern, f):
        print(f)
# app.js
# utils.js

8. 30 Practical Patterns

Email Patterns

# 1. Basic email validation
email_basic = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

# 2. Extract domain from email
email_domain = r'@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'

# 3. Match Gmail addresses only
gmail_only = r'^[a-zA-Z0-9._%+-]+@gmail\.com$'

URL Patterns

# 4. Basic URL matching
url_basic = r'https?://[a-zA-Z0-9.-]+(?:/[^\s]*)?'

# 5. Extract domain from URL
url_domain = r'https?://([^/\s]+)'

# 6. Extract query parameters
query_param = r'[?&]([^=&]+)=([^&]*)'

IP Addresses

# 7. IPv4 address
ipv4 = r'\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b'

# 8. Private IP addresses
private_ip = r'\b(?:10\.\d{1,3}\.\d{1,3}\.\d{1,3}|172\.(?:1[6-9]|2\d|3[01])\.\d{1,3}\.\d{1,3}|192\.168\.\d{1,3}\.\d{1,3})\b'

Phone Numbers

# 9. US phone number
us_phone = r'(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'

# 10. International E.164 format
intl_phone = r'\+\d{1,3}\d{4,14}'

# 11. UK phone number
uk_phone = r'(?:\+44|0)\d{4}[\s-]?\d{6}'

Date and Time

# 12. YYYY-MM-DD
date_iso = r'\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])'

# 13. Multiple date formats
date_multi = r'\d{4}[/.-]\d{1,2}[/.-]\d{1,2}'

# 14. 24-hour time
time_24h = r'(?:[01]\d|2[0-3]):[0-5]\d(?::[0-5]\d)?'

# 15. ISO 8601 datetime
iso_datetime = r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?(?:Z|[+-]\d{2}:\d{2})'

Development Patterns

# 16. HTML tags
html_tag = r'<\/?[a-zA-Z][a-zA-Z0-9]*(?:\s[^>]*)?\/?>'

# 17. HEX color codes
hex_color = r'#(?:[0-9a-fA-F]{3}|[0-9a-fA-F]{6})\b'

# 18. Semantic versioning (SemVer)
semver = r'\bv?(\d+)\.(\d+)\.(\d+)(?:-([\w.]+))?(?:\+([\w.]+))?\b'

# 19. JSON keys
json_key = r'"([^"\\]*)"\s*:'

# 20. CSS class selectors
css_class = r'\.[a-zA-Z_][\w-]*'

CJK Characters

# 21. Korean characters (Hangul)
hangul = r'[\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F]+'

# 22. Japanese Hiragana
hiragana = r'[\u3040-\u309F]+'

# 23. Japanese Katakana
katakana = r'[\u30A0-\u30FF]+'

# 24. CJK Unified Ideographs (Kanji/Hanzi)
cjk = r'[\u4E00-\u9FFF]+'

# 25. Any CJK character
cjk_all = r'[\u3000-\u9FFF\uAC00-\uD7AF]+'

Security and Validation

# 26. Strong password
strong_password = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$'

# 27. SQL injection suspicious patterns
sql_injection = r"(?i)(?:union\s+select|or\s+1\s*=\s*1|drop\s+table|insert\s+into)"

# 28. XSS suspicious patterns
xss_pattern = r'(?i)<script[^>]*>.*?</script>'

Other Useful Patterns

# 29. Normalize whitespace (collapse multiple spaces)
text = "too    many   spaces"
print(re.sub(r'\s+', ' ', text))
# 'too many spaces'

# 30. CSV field extraction (comma-separated, quoted support)
csv_field = r'(?:"([^"]*)"|([^,]*))'

9. Language-Specific Differences

JavaScript

// Literal syntax
const re1 = /hello/gi;

// Constructor syntax
const re2 = new RegExp('hello', 'gi');

// Key methods
'hello world'.match(/hello/);       // ['hello']
'hello world'.replace(/hello/, 'hi'); // 'hi world'
/hello/.test('hello world');         // true

// Named Groups (ES2018+)
const match = '2025-03-23'.match(
  /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
);
console.log(match.groups.year);  // '2025'

// matchAll (ES2020+)
const text = 'cat bat sat';
for (const m of text.matchAll(/[a-z]at/g)) {
  console.log(m[0], m.index);
}

// Lookbehind (ES2018+)
'$100 200 $300'.match(/(?<=\$)\d+/g);  // ['100', '300']

Python

import re

# Core methods
re.match(r'^hello', 'hello world')   # Match at start only
re.search(r'hello', 'say hello')     # Match anywhere
re.findall(r'\d+', 'a1 b2 c3')      # All matches as list
re.sub(r'\d+', 'N', 'a1 b2')        # Substitution: 'aN bN'

# Compilation (performance boost for repeated use)
pattern = re.compile(r'\d{3}-\d{4}')
pattern.findall('555-1234-5678')

# Flags
re.IGNORECASE  # Case-insensitive
re.MULTILINE   # ^ and $ match line boundaries
re.DOTALL      # . matches newlines too
re.VERBOSE     # Allow comments and whitespace

# VERBOSE example
phone_re = re.compile(r'''
    (\d{3})     # area code
    [-.\s]?     # separator
    (\d{3,4})   # middle digits
    [-.\s]?     # separator
    (\d{4})     # last digits
''', re.VERBOSE)

Java

import java.util.regex.*;

// Basic usage
Pattern pattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2}");
Matcher matcher = pattern.matcher("Date: 2025-03-23");
if (matcher.find()) {
    System.out.println(matcher.group());  // "2025-03-23"
}

// Named Groups
Pattern datePattern = Pattern.compile(
    "(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})"
);
Matcher m = datePattern.matcher("2025-03-23");
if (m.find()) {
    System.out.println(m.group("year"));  // "2025"
}

// String methods
"hello world".matches("hello.*");   // true
"a1b2c3".replaceAll("\\d", "N");   // "aNbNcN"
"a,b,,c".split(",", -1);           // ["a", "b", "", "c"]

Go

package main

import (
    "fmt"
    "regexp"
)

func main() {
    // Compile
    re := regexp.MustCompile(`\d{4}-\d{2}-\d{2}`)

    // Match check
    fmt.Println(re.MatchString("2025-03-23"))  // true

    // Find
    fmt.Println(re.FindString("Date: 2025-03-23"))  // "2025-03-23"

    // Find all
    fmt.Println(re.FindAllString("2025-03-23 and 2025-12-31", -1))

    // Named Groups
    re2 := regexp.MustCompile(
        `(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})`,
    )
    match := re2.FindStringSubmatch("2025-03-23")
    for i, name := range re2.SubexpNames() {
        if name != "" {
            fmt.Printf("%s: %s\n", name, match[i])
        }
    }
}

Feature Comparison

FeatureJavaScriptPythonJavaGo
LookaheadYes (ES2018+)YesYesNo
LookbehindYes (ES2018+)YesYesNo
Named GroupsYes (ES2018+)YesYesYes
PossessiveNoNo (regex module)YesNo
Atomic GroupsNoNo (regex module)YesYes
Unicode CategoriesYesYesYesYes
Recursive PatternsNoNo (regex module)NoNo
VERBOSE ModeNoYesYes (COMMENTS)No

10. Performance Optimization and ReDoS Prevention

Understanding Backtracking

NFA regex engines use backtracking to try alternative paths when a match fails.

# Normal backtracking
# Pattern: a*b
# Input: "aaac"
# Tries: "aaa" + b? fail -> "aa" + b? fail -> "a" + b? fail -> "" + b? fail
# 4 backtracks total — no problem

Catastrophic Backtracking

import time

# Dangerous pattern!
dangerous_pattern = r'(a+)+b'
safe_pattern = r'a+b'

# Short input — both fast
text_short = 'a' * 10 + 'c'

# Long input — dangerous pattern takes exponential time
text_long = 'a' * 25 + 'c'

start = time.time()
re.match(safe_pattern, text_long)
print(f"Safe pattern: {time.time() - start:.4f}s")

# WARNING: The following may take a very long time!
# start = time.time()
# re.match(dangerous_pattern, text_long)
# print(f"Dangerous pattern: {time.time() - start:.4f}s")

Vulnerable ReDoS Patterns

# 1. Nested quantifiers
r'(a+)+'      # Dangerous!
r'(a*)*'      # Dangerous!
r'(a+)*'      # Dangerous!

# 2. Overlapping alternatives
r'(a|a)+'     # Dangerous!
r'(\d+|\d+\.)+' # Dangerous!

# 3. Safe replacements
r'a+'         # Instead of (a+)+
r'a*'         # Instead of (a*)*
r'\d+\.?'     # Instead of (\d+|\d+\.)+

ReDoS Prevention Guidelines

  1. No nested quantifiers — Avoid (a+)+, (a*)* patterns
  2. Eliminate overlapping alternatives — Ensure OR alternatives do not match the same characters
  3. Use atomic groups — Use (?>...) in engines that support it
  4. Limit input length — Validate input length before applying regex
  5. Set timeouts — Put time limits on regex execution
# Timeout in Python
import signal

def timeout_handler(signum, frame):
    raise TimeoutError("Regex timeout!")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout

try:
    re.match(r'(a+)+b', 'a' * 30 + 'c')
except TimeoutError:
    print("Regex execution timed out.")
finally:
    signal.alarm(0)

11. Tools

regex101.com

The most popular online regex tester.

Key features:

  • Real-time match highlighting
  • Auto-generated pattern explanation
  • Supports JavaScript, Python, Go, Java, and more
  • Step-by-step match debugger
  • Save and share patterns

grep and ripgrep

# grep basics
grep -E 'error|warning' /var/log/syslog
grep -P '(?<=user:)\w+' access.log  # Perl-compatible regex

# ripgrep (rg) — faster alternative
rg 'TODO|FIXME' --type py
rg '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' access.log
rg -e 'pattern1' -e 'pattern2' .  # Multiple patterns

sed and awk

# sed — stream editor
# Convert date format: 2025/03/23 -> 2025-03-23
echo "2025/03/23" | sed 's/\//-/g'

# Mask email address
echo "user@example.com" | sed 's/\(.\).*@/\1***@/'
# u***@example.com

# awk — pattern matching + processing
# Print only 200 status code logs
awk '/HTTP\/[0-9.]+" 200/' access.log

# Count requests per IP
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head

IDE Usage

Most IDEs (VS Code, IntelliJ, Vim, etc.) support regex search and replace.

VS Code tips:

  • Ctrl+H (or Cmd+H) for replace mode
  • Click the regex icon (.*) to enable
  • Use $1, $2 for backreferences in replacement

12. Interview Questions

Q1. What is the difference between .* and .*??

.* is greedy — it matches as many characters as possible. .*? is lazy — it matches as few characters as possible.

Q2. How do ^ and $ behave differently in multiline mode?

In default mode, ^ matches the start of the string and $ matches the end. In multiline mode (re.MULTILINE), they match the start and end of each line.

Q3. What is the difference between lookahead and lookbehind?

Lookahead (?=...) checks the pattern after the current position, while lookbehind (?<=...) checks the pattern before it. Both are zero-width assertions that do not consume characters.

Q4. What is ReDoS and how do you prevent it?

ReDoS is when an inefficient regex causes exponential backtracking, consuming excessive CPU. Prevent it by avoiding nested quantifiers, limiting input length, and setting execution timeouts.

Q5. What is the difference between capturing and non-capturing groups?

Capturing groups (...) store the matched string for later reference. Non-capturing groups (?:...) group without storing, providing a slight performance benefit.

Q6. What does \b do?

It matches a word boundary — the position between a word character (\w) and a non-word character, or between a word character and the start/end of the string.

Q7. What are the limitations of using regex for email validation?

Fully implementing the RFC 5322 standard requires an extremely complex pattern. In practice, use regex for basic format validation only and verify actual existence through confirmation emails.

Q8. What is the difference between re.match and re.search in Python?

re.match only attempts matching at the start of the string, while re.search scans the entire string for the first match.

Q9. How do you match a literal backslash in regex?

Use \\\\ in the pattern (double escaping for both string and regex). In Python, raw strings (r'\\') require only \\.

Q10. Why should you not parse HTML with regex?

HTML is a context-free language, not a regular language. Nested tags, comments, CDATA sections, and other features cannot be correctly handled by regex. Use an HTML parser (BeautifulSoup, Cheerio, etc.) instead.


13. Quiz

Q1. What does the pattern \d{3}-\d{4} match in "010-1234-5678"?

It matches "010-1234" (the first occurrence). Using re.findall would find both "010-1234" and "234-5678". To match the full phone number, use \d{3}-\d{4}-\d{4}.

Q2. Why is (a+)+b a dangerous pattern?

When the input does not end with 'b' (e.g., "aaaaac"), the engine tries every combination of inner and outer group repetitions, causing exponential backtracking. With just 25 'a' characters, millions of attempts are needed. Simply replacing it with a+b solves the problem.

Q3. What is the difference between (?:...) and (?=...)?

(?:...) is a non-capturing group that consumes characters while grouping (it just does not capture). (?=...) is a lookahead that does not consume characters at all — it is a zero-width assertion that only checks a condition.

Q4. Why can Go not use lookahead?

Go's regexp package uses the RE2 engine, which guarantees linear time execution by not using backtracking. Lookahead requires backtracking, so RE2 does not support it.

Q5. Which of the following patterns match "abc"? a) [abc]+ b) [^abc]+ c) a.c d) a\bc

a) [abc]+ — matches all of "abc". b) [^abc]+ — matches characters NOT in a, b, c, so it fails. c) a.c — the . matches b, so it matches "abc". d) a\bc\b is a word boundary, but there is no word boundary between 'a' and 'b', so it fails. The answers are a and c.


14. References

  1. Mastering Regular Expressions (Jeffrey Friedl) — The definitive regex book
  2. Regular-Expressions.infohttps://www.regular-expressions.info/
  3. regex101https://regex101.com/ — Online regex tester
  4. Regexrhttps://regexr.com/ — Visual regex learning
  5. MDN Web Docs: Regular Expressions — JavaScript regex reference
  6. Python re module docshttps://docs.python.org/3/library/re.html
  7. RE2 Syntaxhttps://github.com/google/re2/wiki/Syntax
  8. ReDoS Prevention — OWASP Guide
  9. ripgrep (rg)https://github.com/BurntSushi/ripgrep
  10. Debuggexhttps://www.debuggex.com/ — Regex visualization
  11. RegExp Playgroundhttps://regexper.com/ — Railroad diagrams
  12. Automata Theory — Regular languages and finite automata