Regex From Zero to Confident: Character Classes, Quantifiers, Anchors, Groups, Lookarounds, and ReDoS

Introduction — Regex Is a Tiny Language
Literals and Metacharacters
Character Classes — "Any One of These"
Quantifiers — "How Many Times?"
Anchors and Boundaries — "Where?"
Groups and Capturing — Bundling and Remembering
Alternation — "This or That"
Lookarounds — Peeking Without Consuming
Greedy vs Lazy — The Appetite of Matching
Catastrophic Backtracking and ReDoS
When You Should Not Use Regex
A Few Practical Tips
Wrapping Up
References

Introduction — Regex Is a Tiny Language

The first time you see a regular expression (regex), it looks like the result of a cat walking across a keyboard. But regex is not random — it is a small, precise language for describing patterns in text. Understand a handful of building blocks and those intimidating symbols start to read like sentences.

This post builds regex up from the ground. We learn the pieces one at a time and then tackle the performance traps that trip people up in production. One rule up front: every regex pattern in this article lives inside inline code or a code block. Regex is full of special characters like braces and angle brackets, and leaving them loose in prose can break rendering. That habit stays useful when you write regex in your own code, too.

If you want to learn by building patterns yourself, keep this site's Regex Tester open as you read and paste patterns in to check them live.

Literals and Metacharacters

The simplest regex is just the text you want to find. The pattern cat finds the three consecutive letters "cat" inside a string. Ordinary characters like these are called literals.

What makes regex powerful is metacharacters. These carry special meaning rather than matching themselves. The common metacharacters are:

  . ^ $ * + ? ( ) [ ] { } | \

When you want to match one of these characters literally, you escape it with a backslash. For example, to match an actual period rather than the "any character" metacharacter, you write \.. A pattern that matches a dot in a domain looks like example\.com.

Character Classes — "Any One of These"

A character class is wrapped in square brackets and matches "one of the characters inside."

[aeiou] matches a single vowel.
[a-z] matches one lowercase letter. The hyphen denotes a range.
[a-zA-Z0-9] matches one alphanumeric character.
Putting a caret at the front, as in [^0-9], negates the class, matching a single non-digit character.

Common character classes have short shorthands.

\d is a digit, \D is a non-digit.
\w is a word character (letters, digits, underscore), \W is the opposite.
\s is whitespace (space, tab, newline, and so on), \S is the opposite.
. is any single character (except newline, by default).

For example, three digits can be written \d\d\d — though the quantifiers we cover next make it shorter.

Quantifiers — "How Many Times?"

A quantifier says how many times the element right before it repeats.

* is zero or more. (optional and repeatable)
+ is one or more. (at least once)
? is zero or one. (optional)
{n} is exactly n times.
{n,} is n or more times.
{n,m} is between n and m times.

Those three digits from before become the tidy \d{3}. If you want three to four digits, like the exchange part of a phone number, you write \d{3,4}. One or more word characters is \w+, and an optional protocol "s" is https? (that is, http or https).

Anchors and Boundaries — "Where?"

If the pieces so far decided "what" to match, anchors pin down "where." An anchor does not match a character — it matches a position.

^ is the start of the string (or line).
$ is the end of the string (or line).
\b is a word boundary — the seam between a word character and a non-word character.
\B is a position that is not a word boundary.

For example, ^\d+$ matches "a string made entirely of digits from start to finish." Without anchors, \d+ would also match the "123" inside "abc123def"; wrap it in a start anchor ^ and an end anchor and the whole string must be digits to match. In input validation, that difference is decisive.

The word boundary \b is handy too. \bcat\b matches the standalone word "cat" but not the "cat" inside "category" or "concatenate."

Groups and Capturing — Bundling and Remembering

Parentheses bundle several elements into a group. Groups do two things: they apply a quantifier to several characters at once, and they capture the matched portion so you can pull it out later.

(ab)+ matches "ab" repeated one or more times ("ababab" and so on). Without the group, ab+ matches "abbbb" — a completely different meaning.
The date-splitting pattern (\d{4})-(\d{2})-(\d{2}) captures year, month, and day as groups 1, 2, and 3. Your program can then read those capture groups out.

When you want to bundle without capturing, use a non-capturing group, written (?:...). For example, (?:https?://)? optionally bundles the protocol part without capturing it. Trimming unnecessary captures makes the pattern's intent clearer and gives a small performance win.

Many languages also support named groups. Writing (?<year>\d{4}) lets you retrieve by name instead of a numeric index, which reads much better.

Alternation — "This or That"

The pipe symbol means alternation — "left side or right side."

cat|dog matches "cat" or "dog."
To limit the scope of the alternation, wrap it in a group. Compare these two:

^(cat|dog)$    → the whole string is "cat" or "dog"
^cat|dog$      → parsed as "^cat" or "dog$" (not what you meant)

Only the grouped first form means "the whole string is cat or dog." Without the group, the pipe's scope spreads across the entire pattern and you get something completely different.

For several choices, list them: (jpg|jpeg|png|gif). Mind the order and the anchors so you capture exactly the part you want.

Lookarounds — Peeking Without Consuming

Lookarounds are a slightly more advanced tool. They check whether a pattern is present ahead or behind, without including (consuming) that part in the match. There are four.

Positive lookahead (?=...): if this comes next.
Negative lookahead (?!...): if this does not come next.
Positive lookbehind (?<=...): if this preceded.
Negative lookbehind (?<!...): if this did not precede.

A practical example: to find where to insert thousands separators into a number, you use a lookahead to find "a position with a multiple of three digits remaining after it." Or in password-rule validation, when you require "contains at least one digit," you use (?=.*\d). That fragment consumes no characters at all — it only checks the condition "there is a digit somewhere." Stack several conditions like ^(?=.*[a-z])(?=.*\d).{8,}$ and you validate "contains a lowercase letter, contains a digit, at least 8 characters" in one shot.

Greedy vs Lazy — The Appetite of Matching

Quantifiers have a hidden personality. By default they are greedy — they try to eat as much as possible. This trait is often a trap.

Say you try to grab an HTML tag with the pattern <.+>. On the string "bold", you probably expect just  to match, but the greedy .+ swallows as much as it can and grabs the entire bold. It ate everything from the first angle bracket to the last one.

The fix is to make the quantifier lazy by putting a ? after it. <.+?> eats "as little as possible," so it grabs just . To summarize:

*, +, ?, {n,m} are greedy — as much as possible.
*?, +?, ??, {n,m}? are lazy — as little as possible.

For the record, a better approach in this example is a negated character class. <[^>]+> eats only "characters that are not an angle bracket," so it never crosses the closing bracket in the first place. Designing away backtracking like this is the key to preventing the performance problem we look at next.

Catastrophic Backtracking and ReDoS

Many regex engines work by backtracking — when a match fails, they go back and try other possibilities. Usually this is fine, but a badly written pattern can make the number of possibilities to try explode exponentially with input length. This is catastrophic backtracking.

The classic dangerous pattern arises when a repetition nests inside another repetition with a fuzzy boundary. Give a pattern like (a+)+$ an input such as "aaaaaaaaaaX" where the match fails at the end, and the engine tries the countless ways of splitting the a's between the inner and outer groups until it effectively hangs. Add just a few characters to the input and the time explodes.

Exploiting this vulnerability to paralyze a service is a ReDoS (Regular expression Denial of Service) attack. With a single maliciously crafted input, an attacker can peg a server's CPU at 100%. ReDoS vulnerabilities have been found again and again in well-known libraries.

Here is how to defend against it.

Avoid nested quantifiers: be wary of structures where a repetition nests inside a repetition with a fuzzy overlap, like (a+)+ or (a*)*.
Be as specific as possible: a narrow character class like [^>] instead of a broad . leaves less room for backtracking.
Anchor it down: binding the match range with ^ and $ leaves the engine less room to wander.
Consider linear-time engines: engines that do not backtrack and always guarantee linear time, like RE2 (Google) or Rust's regex crate, make ReDoS impossible by construction.
Time out on untrusted input: if your language or library supports it, put a timeout on regex execution.

When You Should Not Use Regex

Regex is powerful but not a cure-all. The most famous counterexample is parsing HTML (or XML). HTML is a nested, recursive structure, and traditional regex fundamentally cannot express nesting of arbitrary depth. The attempt to force-parse HTML with regex was famously and vehemently warned against in a legendary Stack Overflow answer, and in practice it collapses on all sorts of edge cases. HTML should be handled with a dedicated parser (a DOM parser).

Other signs regex is the wrong tool:

When you must balance nested brackets or structure: matching opening and closing brackets to arbitrary depth is not regex's domain.
When the pattern becomes harder to understand than code: a few lines of explicit string-handling code often beat a 100-character regex that takes five minutes to read.
Formats that already have parsers: JSON, CSV, URLs, dates, and the like mostly have battle-tested dedicated parser libraries. Look for one before hand-rolling a regex.

Regex shines brightest at "local, token-level pattern matching." Rough email-format validation, extracting fields from a log line, find-and-replace — these are regex's home turf.

A Few Practical Tips

Finally, some habits that help when you actually use regex.

Comments and extended mode: many languages support the x flag (extended mode), which lets you put whitespace and comments inside a regex and spread it across multiple lines. The more complex the pattern, the more maintainable this makes it.
Understand the flags: flags like case-insensitive (i), multiline (m, where ^ and $ apply per line), and dot-matches-newline (s) change the results significantly.
Precompile: if you reuse the same pattern, compile it once outside the loop and reuse it for better performance.
Pair it with tests: regex is subtle. Building tests with representative inputs and edge cases keeps you safe when you later change the pattern.

The best way to cement what you learned is to solve problems. Check each concept with this site's Regex Quiz, and experiment with your own patterns in the Regex Tester.

Wrapping Up

Regex looks like a cipher, but in the end it is a combination of a few building blocks. Literals and character classes decide "what," quantifiers decide "how many," anchors decide "where," and groups, alternation, and lookarounds refine the structure. Add the difference between greedy and lazy, plus awareness of the catastrophic-backtracking trap, and you can already handle most real-world situations with confidence.

The most important lesson is restraint. Use regex for local pattern matching, not for parsing nested structures. Hold that line and regex stops being a dangerous incantation and becomes a dependable tool.

References

MDN: Regular expressions guide — https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions
regular-expressions.info — https://www.regular-expressions.info/
OWASP: Regular expression Denial of Service (ReDoS) — https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
Google RE2 engine — https://github.com/google/re2
Rust regex crate (linear-time guarantee) — https://docs.rs/regex/