Why Passing Tests Are No Longer Enough

When the lead on a project says the rewrite is done because “100% of the tests pass,” Sanfilippo hears an alarm bell. To him, that line describes a seductive but fragile idea of software, as if a fixed suite could guarantee coverage of every possible state. His view is that this model is losing centrality, and that in its place a dirtier, more expensive, and more realistic way of verifying programs is emerging: running them through agents that behave like users, QA engineers, and even virtual developers. The promise is not perfection, but a higher probability of finding trouble before it reaches users.

100% test coverage is an illusion

Sanfilippo starts from an almost instinctive reaction to the Banne rewrite: the idea that a codebase is truly correct because “100% of the tests pass” seems reassuring only on the surface. His objection is clear: he is not talking about line coverage, which is already hard in a large project, but about state coverage, the part that really determines a system’s behavior.

We finished the rewrite and it’s complete because now 100% of the tests pass.
0:25

I’m talking about 100% state coverage. So basically it’s like saying: okay, now it works like the other one, but how do you know that’s true? It’s essentially impossible.
0:25

From there comes the broader point: for Sanfilippo, traditional testing is already full of holes and survives thanks to extra practices that do not belong to classic automated tests. Precisely because tests are not enough, large projects have QA teams that do integration, smoke tests, and checks closer to real software behavior.

If tests were like the mental model says, there would be no need for QA teams to do integration tests like this.
2:06

We know that our testing methodology is completely full of holes.
2:06

Agentic tests imitate users

The solution Sanfilippo imagines is a shift from fixed tests to agentic testing^*. Instead of always running the same checks, you describe a task in a markdown file, and an LLM is put to work as a QA engineer, calling tools, hitting API endpoints, inventing usage sequences, and trying to break the system the way a real person would.

I make a testing.md file, a markdown file where you tell it: in this text you’ll test this product, you’ll be a QA engineer and you’ll run testing sessions with tool calling.
4:23

It’s like exposing your system to users before users see it.
8:02

Sanfilippo gives a Redis example for arrays: instead of writing a few closed cases, he would let the prompt ask the LLM to invent use cases, stress arrays with Python programs, and scale loads from 10 to 100,000, from 1 million to 50 million entries. Then he would add replication, consistency across replicas, saving and reloading, a sequence that looks more like a lab experiment than a unit test.

You have to invent use cases, write Python programs that stress the arrays for that use case, and then start scaling it more and more.
5:03

Then you get into replication, you see whether the data is perfectly consistent across the two replicas, then you save, reload.
5:03

More noise, but also more realism

For Sanfilippo, the advantage of the method is that it does not seek abstract truth but a higher probability of catching the defect before release. The test is not identical every time, because the LLM samples differently and the system itself introduces nondeterminism, especially when timing, tool calls, and output parsing by coding agents are involved.

This test is not always different because the LLM already does sampling when it generates, so it changes.
6:18

You’ll probably catch it earlier, also because you can run this stuff continuously if you want.
8:28

That is why Sanfilippo describes a second agent that checks whether the bug is real, filtering out false positives and opening an email or automated issue if needed. The result resembles a small QA factory, with LLMs testing, other LLMs confirming, and the human team receiving only the strongest alerts.

You can run tests where LLMs are continuously testing your system. Then there’s another model session that checks whether the bug is really there.
8:28

If it really is, you get an email, an automatic issue gets opened, whatever you want.
8:28

Rust, rewrites, and hidden problems

The second half of the argument turns to the Rust rewrite of Banne^*. Sanfilippo says the move will surface “a lot of hidden problems” that have not been noticed yet, and he ties that prediction to software history: every rewrite promises cleanliness, but often brings new complexity and new problems to discover along the way.

The Banne rewrite in Rust, like all rewrites, will reveal that there are a lot of hidden problems they still haven’t noticed.
9:32

That is part of software history itself.
9:32

Here Sanfilippo also corrects a reading that, he says, has circulated badly in the comments: the jump from 600,000 to 1 million lines of code would not be attributable to Rust alone, because the previous code had also been written with AI help. His point is not to defend one technology against another, but to reject an analysis that confuses the language of the code with the quality of the final product.

If you use the fact that the Rust code now sucks as an argument, know that you missed the fact that the previous one was also written with AIs.
9:50

In his view, Rust rewrites also tend to become more verbose for another reason: LLMs, in trying to manage the language’s complexity, add extraction layers and intermediate steps that a skilled human might avoid. From there comes his broader criticism, which is not about one language but about the risk of ending up in a local minimum that is hard to escape without starting almost from scratch.

Rust rewrites end up being slightly more verbose for a number of reasons, but especially because LLMs try to counter certain Rust complexities with more extraction layers.
10:12

Rushing costs more than caution

Sanfilippo closes with a practical engineer’s thesis: when choosing a direction, it is worth investing a lot of time up front to pick the right thing and grow it organically, instead of using passing tests as sufficient proof that the work is good. His target is not rewrites themselves, but the temptation to confuse an operational signal with a substantive guarantee.

It’s worth investing a lot of time at the start to get the right thing and then grow that right thing organically.
11:06

It’s not that because we don’t see the efforts, the efforts aren’t there. Those efforts are there; we just don’t see them.
11:06

For Sanfilippo, the most concrete proof that code is weak is not a theoretical manifesto, but its awkwardness, the slowness of changing it, the effort the team faces when trying to modify it. In that sense, agentic testing does not replace human judgment; it tries to get ahead of it, putting software in front of an artificial audience before the real one arrives.

FAQ

What is agentic testing according to Sanfilippo?

It is testing in which an LLM simulates a QA engineer or a user, tries the product, uses tools, and attempts to break the system. Sanfilippo sees it as an increasingly important part of future testing.

Why aren’t 100% of tests enough?

Because, in Sanfilippo’s view, they do not truly cover every possible state of a complex piece of software. Green tests build confidence, but they do not prove the system will behave well in every real situation.

Why does he talk about Rust in this discussion?

He uses the Banne rewrite in Rust as an example of the limits of rewrites and traditional testing. In his view, a rewrite exposes bugs and often produces more verbose code.

Can LLMs replace human QA?

No, not according to Sanfilippo. They can imitate many QA tasks and surface problems earlier, while humans are still needed to judge ambiguous cases and false positives.

What is his criticism of the comments about the rewrite?

That many people blamed Rust for the growth in code size while ignoring that the previous version had also been written with AI. For him, the discussion was muddled both factually and logically.