Introduction — The Day jqwik Hit the Front Page Again
In the first half of 2026, jqwik — the property-based testing library for Java — hit the Hacker News front page. The trigger was not a new library feature but an experience report from a developer. Date-handling code written by an AI coding agent had passed every unit test, yet when run through jqwik property tests, the framework found an input that broke at a leap-year boundary within seconds. The comment section filled with testimonies — "I had the exact same experience with Hypothesis," "fast-check found a ten-year-old bug in our parser" — and a translated summary on GeekNews drew a long discussion of its own.
The timing was perfect. 2026 is the era in which AI agents write a substantial share of our code. With tools like Claude Code and Codex performing multi-hour tasks autonomously, the central question in the industry has become "how does a human verify that the code is actually correct?" AI can churn out code that passes a handful of examples all day long. The problem is the empty space between the examples. Property-based testing (PBT) is the technique that automatically explores exactly that empty space — which is why it is being rediscovered right now.
In this post we will lay out the core concepts of PBT, present a pattern catalog for discovering properties, and build working examples in three languages: Python (Hypothesis), Java (jqwik), and JavaScript (fast-check). We will then cover failure reproduction, state machine testing, CI integration, and the synergy with verifying AI-written code, all from a practitioner's perspective.
The Limits of Example-Based Testing
The tests we write every day are example-based.
def test_add():
assert add(2, 3) == 5
assert add(-1, 1) == 0
assert add(0, 0) == 0
This approach has three problems.
1. The only inputs tested are the ones the author thought of. Inputs the author never imagined (empty strings, Unicode combining characters, integer overflow boundaries, February 29 in a leap year) are never tested, forever.
2. When the implementer and the test author are the same person, they share the same blind spots. The case you did not consider while implementing is the case you will not consider while testing.
3. Examples only say "this input gives this output." They cannot express the general laws the code must obey.
Property-based testing flips the premise. Instead of concrete examples, you declare a property — something that must hold for every valid input — and the framework generates hundreds of random inputs hunting for a counterexample that breaks it.
Core Concepts — Properties, Generators, Shrinking
The execution flow of PBT looks like this.
+-------------+ +--------------+ +-----------+ +--------------+
| Define | --> | Generator | --> | Check the | --> | Pass: repeat |
| property | | produces | | property | | (default |
| "for all x, | | random | | (assert) | | 100 runs) |
| P(x)" | | inputs | +-----+-----+ +--------------+
+-------------+ +--------------+ | fail
v
+---------------------+
| Shrinking: |
| minimize the input |
| while preserving |
| the failure |
+----------+----------+
v
report "minimal counterexample: x = 0"
- Property: a general law the code must obey. Things like "the sorted output has the same length as the input" or "decoding an encoding returns the original."
- Generator: the component that produces random inputs. From primitives like integers and strings up to composite structures like "a user object with a valid email," built by combination. Good frameworks deliberately over-sample boundary values (0, -1, empty string, NaN, max integer).
- Shrinking: once a counterexample is found, the process of reducing it to the smallest form a human can understand. The value of shrinking is reporting "fails on the empty string" instead of "fails on a string of length 847" — in fact, the quality of a PBT tool is largely the quality of its shrinking.
How to Discover Properties — A Pattern Catalog
The biggest barrier to adopting PBT is "I have no idea what properties our code has." Fortunately, most properties are discovered through one of the following patterns.
| Pattern | Formula | Where it applies |
| --- | --- | --- |
| Round-trip | decode(encode(x)) == x | Serialization, compression, encryption, parser-printer |
| Invariant | A condition that stays true after the operation | Length preserved by sort, balance totals preserved |
| Idempotence | f(f(x)) == f(x) | Normalization, deduplication, UPSERT |
| Model comparison | Compare against a simple reference implementation | Optimized code vs naive code |
| Commutativity/associativity | f(a, b) == f(b, a) and friends | Merging, aggregation, CRDTs |
| Postcondition | A condition the result must satisfy | All search results contain the query |
| Oracle comparison | Compare against a trusted existing implementation | Standard library, legacy system |
| No-crash | Never crashes on valid input | The minimum property of every public API |
Master just round-trip and invariants and you can already apply PBT to half of real-world code. And the last row, the no-crash property, is the most underrated of all. The single property "no unhandled exception for any input" can unearth piles of bugs in parsers and input-validation code.
Hands-on 1 — Python Hypothesis
We start with the perennial workhorse of production code: money math. Let us verify a function that applies a discount rate to a cart total and rounds to whole cents.
cart.py
from decimal import Decimal, ROUND_HALF_UP
def apply_discount(total_cents: int, discount_percent: int) -> int:
"""Apply a discount rate (0-100) to a total in cents, rounding to whole cents."""
if not 0 <= discount_percent <= 100:
raise ValueError("discount_percent must be between 0 and 100")
if total_cents < 0:
raise ValueError("total_cents must be non-negative")
discounted = Decimal(total_cents) * (Decimal(100 - discount_percent) / Decimal(100))
return int(discounted.quantize(Decimal("1"), rounding=ROUND_HALF_UP))
The property tests look like this.
test_cart.py
from hypothesis import given, settings, strategies as st
from cart import apply_discount
@given(total=st.integers(min_value=0, max_value=10**12),
pct=st.integers(min_value=0, max_value=100))
def test_discount_never_negative_and_never_exceeds_total(total, pct):
result = apply_discount(total, pct)
Invariant 1: a discounted amount can never be negative
assert result >= 0
Invariant 2: a discounted amount can never exceed the original total
assert result <= total
@given(total=st.integers(min_value=0, max_value=10**12))
def test_zero_discount_is_identity(total):
Postcondition: a 0 percent discount is the identity function
assert apply_discount(total, 0) == total
@given(total=st.integers(min_value=0, max_value=10**12))
def test_full_discount_is_zero(total):
Postcondition: a 100 percent discount is always zero
assert apply_discount(total, 100) == 0
@given(total=st.integers(min_value=0, max_value=10**12),
pct=st.integers(min_value=0, max_value=100))
def test_discount_is_monotonic(total, pct):
Invariant 3: a larger discount rate yields a result that is less than or equal
if pct < 100:
assert apply_discount(total, pct + 1) <= apply_discount(total, pct)
Had the implementation used floating point instead of Decimal, Hypothesis would find a counterexample where the monotonicity test breaks on large amounts due to rounding error. And thanks to shrinking, you are not handed a monstrous counterexample like "fails at total=10000000001, pct=33" but the minimal one a human can actually debug.
Composite generators are simple too.
A strategy that generates valid order objects
order_strategy = st.builds(
dict,
order_id=st.uuids().map(str),
items=st.lists(
st.builds(dict,
sku=st.text(alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", min_size=8, max_size=8),
qty=st.integers(min_value=1, max_value=999),
unit_price_cents=st.integers(min_value=1, max_value=10**7)),
min_size=1, max_size=50,
),
)
@given(order=order_strategy)
def test_order_total_equals_sum_of_lines(order):
total = calculate_order_total(order)
expected = sum(i["qty"] * i["unit_price_cents"] for i in order["items"])
assert total == expected
Hands-on 2 — Java jqwik
Running on the JUnit 5 platform, jqwik is annotation-driven and blends naturally into existing Java projects. Here is serialization verified with the round-trip pattern.
class CsvCodecProperties {
@Property
void encodeThenDecodeIsIdentity(@ForAll @StringLength(max = 200) String field) {
// Round-trip: any string decoded after encoding equals the original
String encoded = CsvCodec.encodeField(field);
String decoded = CsvCodec.decodeField(encoded);
assertThat(decoded).isEqualTo(field);
}
@Property
void encodedFieldNeverBreaksRowStructure(
@ForAll @Size(min = 1, max = 10) List<@StringLength(max = 50) String> fields) {
// Invariant: row structure survives commas/newlines/quotes inside fields
String row = CsvCodec.encodeRow(fields);
List<String> parsed = CsvCodec.parseRow(row);
assertThat(parsed).isEqualTo(fields);
}
@Property
void normalizeIsIdempotent(@ForAll String input) {
// Idempotence: normalizing twice equals normalizing once
String once = CsvCodec.normalize(input);
String twice = CsvCodec.normalize(once);
assertThat(twice).isEqualTo(once);
}
@Provide
Arbitrary<String> koreanText() {
// Custom generator: strings that include the Korean syllable range
return Arbitraries.strings()
.withCharRange('가', '힣')
.withCharRange('a', 'z')
.ofMaxLength(100);
}
@Property
void roundTripWithKorean(@ForAll("koreanText") String field) {
assertThat(CsvCodec.decodeField(CsvCodec.encodeField(field))).isEqualTo(field);
}
}
If you have ever hand-written a CSV encoder, you can guess what happens: this round-trip property almost certainly finds the classic bugs — quotes inside quotes, a newline at the end of a field, distinguishing empty fields from null. Enumerating all those combinations with example-based tests is practically impossible.
Hands-on 3 — JavaScript fast-check
fast-check pairs naturally with Jest/Vitest. Here is the model-comparison pattern: checking an optimized function against a naive implementation.
// Reference implementation: slow but obviously correct
function mergeIntervalsNaive(intervals: Array<[number, number]>): Array<[number, number]> {
const points = new Set<number>();
for (const [s, e] of intervals) {
for (let i = s; i < e; i++) points.add(i);
}
// regroup into contiguous ranges (only usable on small domains)
const sorted = [...points].sort((a, b) => a - b);
const out: Array<[number, number]> = [];
for (const p of sorted) {
const last = out[out.length - 1];
if (last && last[1] === p) last[1] = p + 1;
else out.push([p, p + 1]);
}
return out;
}
describe("mergeIntervals", () => {
it("optimized implementation always matches the naive one", () => {
fc.assert(
fc.property(
fc.array(
fc.tuple(fc.integer({ min: 0, max: 100 }), fc.integer({ min: 0, max: 100 }))
.map(([a, b]) => (a <= b ? [a, b] : [b, a]) as [number, number]),
{ maxLength: 30 }
),
(intervals) => {
const fast = mergeIntervals(intervals);
const slow = mergeIntervalsNaive(intervals);
return JSON.stringify(fast) === JSON.stringify(slow);
}
)
);
});
it("result intervals are always sorted and non-overlapping", () => {
fc.assert(
fc.property(
fc.array(fc.tuple(fc.nat(1000), fc.nat(1000)), { maxLength: 50 }),
(raw) => {
const intervals = raw.map(([a, b]) => (a <= b ? [a, b] : [b, a]) as [number, number]);
const merged = mergeIntervals(intervals);
for (let i = 1; i < merged.length; i++) {
// Invariant: end of previous interval < start of next
if (merged[i - 1][1] >= merged[i][0]) return false;
}
return true;
}
)
);
});
});
The charm of model comparison is that you do not need to know "the right answer." With one slow-but-obviously-correct implementation in hand, you can verify against hundreds of inputs that the optimized version behaves identically. It is an especially strong regression guard for performance-optimization PRs.
Reproducing Failures — Fixed Seeds and the Counterexample Database
The classic worry about randomized testing: "what if it failed yesterday and passes today?" Modern PBT tools have solved this.
Hypothesis: failing counterexamples are saved automatically under
the .hypothesis/examples directory and retried first on the next run
(the example database)
You can also pin a specific counterexample as a permanent regression test
from hypothesis import example
@given(st.text())
@example("") # explicitly pin counterexamples that failed in the past
@example("\x00")
def test_normalize_roundtrip(s):
assert denormalize(normalize(s)) == s
// jqwik: the seed is printed on failure and can be replayed exactly
@Property(seed = "8723648723648")
void reproducesFailure(@ForAll String input) { /* ... */ }
// jqwik also stores failing samples in .jqwik-database and retries them first
// fast-check: the failure report prints seed and path
fc.assert(prop, { seed: 1042, path: "0:0:1" }); // rerun starting from that exact counterexample
The recommended operating procedure for CI: take the seed or counterexample printed in the failure log and bake it into the code as an example or seed, promoting it to a regression test. Done this way, randomness stops being a "non-reproducible flaky" and becomes a machine that automatically mines a permanent regression suite for you.
State Machine Testing — Stateful Testing
So far we have tested pure functions, but the hard bugs of production live in stateful code: caches, connection pools, shopping carts, DB layers. State machine testing generates random sequences of operations and runs the real implementation side by side with a simple model, asserting agreement at every step.
A state machine test comparing an LRU cache against a dict model (Hypothesis)
from hypothesis import strategies as st
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from lru import LRUCache
CAPACITY = 8
class LRUCacheMachine(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.real = LRUCache(capacity=CAPACITY)
self.model = {} # model: a dict that remembers order (Python dicts preserve insertion order)
@rule(key=st.integers(0, 20), value=st.integers())
def put(self, key, value):
self.real.put(key, value)
self.model.pop(key, None)
self.model[key] = value
if len(self.model) > CAPACITY:
oldest = next(iter(self.model))
del self.model[oldest]
@rule(key=st.integers(0, 20))
def get(self, key):
expected = self.model.get(key)
if expected is not None:
refresh recency in the model as well
self.model.pop(key)
self.model[key] = expected
assert self.real.get(key) == expected
@invariant()
def size_never_exceeds_capacity(self):
assert self.real.size() <= CAPACITY
TestLRUCache = LRUCacheMachine.TestCase
When this test fails, Hypothesis shrinks the run down to a minimal operation sequence — something like "fails after put(3, 1), put(4, 2), get(3), put(5, 9)." Bugs like a missed LRU refresh or an off-by-one at the capacity boundary rarely survive this treatment. fast-check supports the same pattern with its commands API, and jqwik with action chains.
Where to Apply It First — Production Domains
Here are the domains where PBT is especially strong, in priority order.
| Domain | Recommended property patterns | Expected payoff |
| --- | --- | --- |
| Parsers/formatters | Round-trip, no-crash | Mass discovery of input corner cases |
| Serialization | Round-trip, schema compatibility | Early detection of cross-version breakage |
| Money/quantity math | Invariants, monotonicity, sum preservation | Blocks rounding/overflow bugs |
| Normalization/dedup | Idempotence | Blocks double-application bugs |
| Data structure implementations | Model comparison, state machines | Exhaustive boundary coverage |
| Concurrent code | State machines plus randomized interleaving | Helps detect hard-to-reproduce races |
| API input validation | No-crash, postconditions | Robustness from a security standpoint |
Conversely, there are domains where PBT is inefficient. Integration with external systems per se (calling a real payment gateway), pixel-level UI, and logic whose definition of "correct" is subjective (recommendation ranking) are better left un-forced.
CI Integration — Managing Runtime and Flakiness
The two worries about putting PBT into CI — runtime and flakiness — are tamed with configuration.
Hypothesis: separate profiles for local/CI/nightly
from hypothesis import settings, HealthCheck
settings.register_profile("dev", max_examples=50)
settings.register_profile("ci", max_examples=200, deadline=None,
suppress_health_check=[HealthCheck.too_slow])
settings.register_profile("nightly", max_examples=2000)
at run time: HYPOTHESIS_PROFILE=ci pytest
Operating principles:
1. Run a moderate number of examples at the PR gate (100-200) and a large number in nightly builds (thousands). Division of labor: nightly hunts for bugs, PR prevents regressions.
2. Remove time and network dependence from generators. The main culprit behind flakiness is not randomness but hidden nondeterminism (current time, external calls).
3. Configure reporters to log seeds and counterexamples on failure, and promote discovered counterexamples to examples as described above.
4. Per-case deadlines easily produce false failures from CI machine performance variance — disable them in CI or set them generously.
Synergy with Verifying AI Code — The 2026 Angle
The core driver of this revival deserves its own section. AI-written code and PBT are structurally well matched.
1. The failure mode of AI code is "90 percent plausible plus 10 percent subtly wrong." Passing the demo examples while failing at boundary conditions is the typical pattern — and that is exactly the bug class PBT was designed to catch.
2. Properties are specifications, and specifications should stay in human hands. Even if you delegate the implementation to an agent, when a human writes the property definitions — "the round-trip must hold," "totals must be preserved" — the focus of review rises from line-by-line code reading to specification checking.
3. They become the automatic grader in an agent loop. Tell the agent "keep fixing until these property tests pass" and the property tests push counterexamples at it in your place, driving the loop. Crucially, overfitting (hardcoding to the tests) is far harder than with example tests, because you cannot hardcode your way past random inputs.
One caution: do not hand property-writing itself entirely to the AI. A property that shares the same misunderstanding as the implementation passes while being wrong together with it (the self-consistency trap). Properties should originate from requirements and domain knowledge — that is the part the human contributes.
Adoption Guide — Gradually, Into Existing Tests
No big-bang adoption is required. The recommended sequence:
1. Week 1: pick one function pair in your existing code where a round-trip holds (serialization, encoding) and add a single property test. Use this one test to validate tool installation and CI wiring.
2. Week 2: add the no-crash property to two or three public APIs. Finding an unexpected crash gives you the material to persuade the team.
3. Week 3: design invariant properties for one bug-prone module. Discussing "what laws must this module obey?" with the team has the side effect of a design review.
4. After that: every time a new bug is reported, add to the retro the question "what property would have caught this bug?" Property suites grow fastest out of bug postmortems.
Pitfalls — How to Fail at PBT
- Excessive generator complexity: when you start replicating business logic inside generators to produce valid inputs, that is a warning sign. A generator as complex as the implementation means you are testing the generator's bugs. The orthodox fix is "generate simple inputs and reach valid states through the public API" rather than generate-then-filter.
- Tautological properties: copying the implementation into the property (asserting f(x) equals the same formula as the implementation) verifies nothing. Properties must come from a different angle than the implementation — round-trips, model comparison.
- Filter abuse: a filter that throws away 99 percent of generated inputs slows the test and distorts the input distribution. Generate constructively instead (need even numbers? generate integers and multiply by two).
- Custom generation that ignores shrinking: map-based transformations keep shrinking working, but values built directly from external randomness do not shrink, leaving you with gigantic counterexamples. Stay inside the framework combinators.
- The 100 percent replacement fantasy: PBT is a complement to example tests, not a replacement. Representative examples that document intent and property tests that explore the space play different roles.
Checklist
- [ ] Round-trip properties exist for function pairs where round-trips hold
- [ ] Public APIs have a "no crash on valid input" property
- [ ] Money/quantity code has invariants defined (no negatives, sum preservation, monotonicity)
- [ ] A procedure exists to promote discovered counterexamples to permanent regressions via example/seed
- [ ] CI profiles are separated (fast on PR, deep nightly)
- [ ] Generators are simpler than the implementation (if not, revisit the design)
- [ ] Stateful core components have state machine tests
- [ ] Property definitions originate from requirements/domain knowledge (not copied from the implementation)
- [ ] Passing property tests is part of the merge criteria for AI-written code
Closing
Property-based testing is an old technique that began with QuickCheck in 1999, but in 2026 we need it for a new reason. When humans wrote all the code, the problem was "inputs I did not think of." In the era of AI-written code, the problem is the mass production of "implementations nobody thought deeply about." Examples only confirm that an implementation looks plausible; properties confirm that it obeys the law.
No grand rollout is needed. Find one encode-decode pair in your codebase and add a single round-trip property. With high probability, that test will gift you a bug you did not know about within its first week.
References
- jqwik official site: https://jqwik.net/
- jqwik user guide: https://jqwik.net/docs/current/user-guide.html
- Hypothesis documentation: https://hypothesis.readthedocs.io/
- Hypothesis — What is property-based testing?: https://hypothesis.works/articles/what-is-property-based-testing/
- fast-check GitHub: https://github.com/dubzzz/fast-check
- fast-check documentation: https://fast-check.dev/
- The original QuickCheck paper (Claessen & Hughes, 2000): https://dl.acm.org/doi/10.1145/351240.351266
- Hypothesis GitHub: https://github.com/HypothesisWorks/hypothesis
- jqwik GitHub: https://github.com/jqwik-team/jqwik
- Hacker News (many PBT discussions): https://news.ycombinator.com/
- GeekNews: https://news.hada.io/
현재 단락 (1/285)
In the first half of 2026, jqwik — the property-based testing library for Java — hit the Hacker News...