A new era for software testing
Posted by Chrisszz 10 days ago
Comments
Comment by rglover 5 days ago
In theory. The only difference between today and "the aughts" is that we have machines that can spit out a ton of code very quickly.
Nothing has changed about the discipline or honesty around testing (you can skip automated tests even faster now if you wish). You can and should work with AI to write tests, but you have to know the difference between a good test and a "looks good on paper" test in order for it to truly be effective and raise the quality of what you're building.
Comment by onlyrealcuzzo 5 days ago
I've been building a compiler with LLMs for a memory safe language like Rust with near zero cost abstractions (no GC), but with WAY less cognitive overhead.
I can tell you right now:
1) It's 100x more than I could have achieved with zero compiler design experience.
2) I'm HIGHLY skeptical that LLMs can build something of this complexity (in some ways it's more difficult than implementing a Rust compiler) - so the testing is quite robust - 3 different systems (unit, integration, fuzz tests) each with mutant testing, each with between ~65-90% line coverage and ~50-80% branch coverage, combined with ~99% line coverage and ~86% branch coverage.
There is ZERO chance I could get something even close to this level of "working" by myself ever - let alone with minimal effort.
The test is kind of simple - if LLM's can do this... They should be able to do just about anything... Compilers are notoriously difficult to verify they actually work, rather than just kind of work sometimes...
People can say I'm wasting my time all they want.
But, one, it's been enlightening. I'm literally in awe of what they can do and have done.
Two, I've developed a bunch of tooling / metrics necessary to get them to be able to do something at this level of complexity without falling over themselves. And I think it can work at scale pretty easily.
Nearly all of the research comes from the 80s or farther back for the complexity metrics.
Comment by achierius 5 days ago
What you're thinking of is "no runtime" or "lightweight runtime", which does often mean "no garbage collector".
Comment by onlyrealcuzzo 5 days ago
When people think of "zero cost" they don't think about std::optional. They think about not having to manage memory lifetimes AND NOT having to pay for a Garbage Collector to do it for you. That was always the trade you made until Rust.
I add on some cost to locks to prevent deadlock, and some cost to loops to insert co-operative yields in concurrent contexts unless you turn it off.
Comment by 8note 5 days ago
huh? you can rotate and scale the ownership?
Comment by AlotOfReading 5 days ago
Comment by wavemode 4 days ago
Automated verifiability goes down once a software project incorporates things like:
- Concurrency
- Networking / distributed systems
- Visuals / animations
- Domain knowledge (e.g. banking, finance)
Comment by onlyrealcuzzo 4 days ago
Comment by mlmonkey 5 days ago
But not any more! Now I point the LLM to the code and order it to write unit tests, covering all edge cases, etc. I'd rather spend 3 hours arguing with the LLM than writing unit tests! :-D
Comment by dkn 5 days ago
Comment by mplanchard 5 days ago
Comment by dcastm 5 days ago
Comment by dkn 5 days ago
I instruct the LLM to follow TDD practices in certain areas, but otherwise prioritize integration style tests at the edges.
Comment by aplomb1026 5 days ago
Comment by spaceclay 5 days ago
Comment by zerr 5 days ago
Comment by kovek 5 days ago
Comment by pydry 3 days ago
If you find writing tests tedious enough to make using an LLM to write them seem like a good idea you're probably churning out repetitive tests, unnecessary tests, tests which aren't great at catching bugs.
Comment by bob1029 5 days ago
https://playwright.dev/docs/release-notes#version-159
If you set this up correctly, you can have a main agent issue natural language testing instructions to this playwright agent which returns a natural language summary of what it experiences. This is the sort of thing where I begin to get interested in the idea of agents working while I sleep.
Comment by avensec 5 days ago
Given your code-base is mature enough, please don't have a single Skill/Steering/Persona/Ruleset (or whatever) for your "QA Engineer." This is just the same "my behavioral file can one-shot the entire system build" kind of thinking that will give you expensive, marginal results as the system grows.
If you want to have success in this space, get really fine-grained. Every single test scope needs its own behavioral files.
Have your core behavioral file define some simple specifics around Test Pyramid, Test Purposes, checks for tautological tests, etc. Then get _really_ specific;
<test-type>-architect (plan)
<test-type>-engineer (execute)
<test-type>-resolver (problem solver, maintenance, how to manage a failure, etc.)
e.g., playwright-architect, etc.
Then create additional ones for Unit tests, API tests, contract tests, or any other required test layer for the SUT.
Overengineered? Maybe given the size of your codebase. But for anything significant, you are codifying what humans and their skillsets do.
Comment by spaceclay 5 days ago
Comment by kulahan 5 days ago
Ten million blackboxes with ten billion tests or whatever. Otherwise it’s literally the blind leading the blind
Comment by simianwords 10 days ago
Two of the reasons I never liked writing tests is
- they didn’t seem to usually assert much internal logic
- they would have to be maintained along with the original code
I think scenario testing is much better instead because the actual way a person uses a feature hardly changes but the internals might change a lot.
So imagine I’m making an e-commerce website. There are lots of internal mechanisms. I’ll have an agent testing all the functionalities as if it were a customer. This gives me much much more confidence while writing code because it is more uncorellated with the code.
Tomorrow I can change a lot of internals but the testing agent stays the same.
There’s something to note though: not all code is possible to be scenario tested. Like data engineering and other things where the feedback time is huge.
Comment by anthonypasq 5 days ago
i feel like im going insane
Comment by hugs 5 days ago
Comment by acdha 5 days ago
Comment by Daishiman 5 days ago
Comment by simoncion 5 days ago
People falling all over themselves to write docs for their pile-of-linear-algebra-with-a-smiley-face-painted-on-it [0] don't read the docs, no. People who give a shit about writing solid software that doesn't get them paged at three in the damn morning do.
[0] The face is there to provide social-trustworthiness signals to engage the human pack-bonding instinct, natch.
Comment by Daishiman 5 days ago
A decade ago I left a job and spent the last week thoroughly documenting every flow and code section of an app that I worked with, which was the core value proposition of the company. A couple years later I ask around and nobody even took a look at that.
People just don't read, and there are actually good reasons for that, one of them being that documentation is outdated in most orgs and the effort to keep it up to date is greater than reading the code.
Comment by simoncion 5 days ago
Wow. What I said is true and reflects the experience of a lot of people. Amazing!
Comment by Daishiman 5 days ago
Comment by simianwords 5 days ago
Comment by acdha 5 days ago
Comment by dragonwriter 5 days ago
Comment by inigyou 5 days ago
Comment by righthand 5 days ago
Comment by dragonwriter 5 days ago
Comment by avensec 5 days ago
Comment by konart 5 days ago
How is scenario different from a behavior (as in Behavior-Driven Development)?
Gherkin and things like Cucumber are not something new, are they?
Comment by rahoulb 5 days ago
They write really good Gherkin features and then work inwards writing unit tests as they go - checking that they fail before implementation so it's actually testing something worthwhile.
And the code they ship is decent quality (not as good as me most of the time - but a LOT better than me when I'm tired or I'm pissed off about something or the work is really boring).
Comment by pbalau 5 days ago
Comment by righthand 5 days ago
Comment by hulitu 8 days ago
Are you an engineer ? You must test your "creation". Or would you expect that the microwave owen you just bougth will be tested by your child while getting burned ?
Comment by robotresearcher 5 days ago
Comment by marshalhq 5 days ago
Comment by onemoresoop 5 days ago
Comment by pfdietz 5 days ago
Comment by ahartmetz 3 days ago
- If you call the setter, the getter returns the same value - these are kinda bullshit and would be caught by the next level anyway
- Testing basic normal use
- Testing known difficulties of the implementation
- Exhaustive or randomized (if necessary) testing of the state space, ~= property-based testing
I expect AI to have very different levels of ability for these, not necessarily in strictly descending order as listed.
Comment by ptx 4 days ago
If so, if this is meant to imply that LLMs are just another step towards higher-level abstractions, the analogy doesn't quite work. Unlike a COBOL compiler, the LLMs output can't be predicted or reasoned about, so you can't really fix bugs in your program (i.e. your prompt) but only try to permute it haphazardly and hope for the best.
[1] https://ethw.org/Milestones:A-0_Compiler_and_Initial_Develop...
Comment by wrxd 10 days ago
Comment by simianwords 10 days ago
Unit tests and deterministic tests are hard to get right and need to be done at the correct boundary.
I have seen many people dogmatically pushing unit tests religiously but this often leads to very hard to maintain tests that mostly exist just to change along with the main code itself.
A good way to understand if your unit tests are good: are you changing them along with changing your actual code? Then it’s a bad test. I think the argument for “it’s just documentation” is weak.
Comment by fcarraldo 10 days ago
Of course, if you’re just watching Claude changing both and saying “LGTM” then it’s not very valuable.
Comment by skydhash 5 days ago
Unit tests are great for pure algorithms, like file format, data encoding, crypto,… etc. Everything with a specs that will rarely changes. You write your tests once and basically never have to update them.
But for requirements that changes often like in a enterprise settings or applications, maintaining a suite of unit tests is expensive. Integration tests are better because contracts between modules don’t change that much. Even if the suite are not exhaustive, they’re useful enough to catch some failures.
Comment by simianwords 3 days ago
Yes this is what I'm trying to say.
Comment by npodbielski 5 days ago
Comment by wesselbindt 5 days ago
Comment by devin 5 days ago
Comment by jason_s 4 days ago
Comment by kofj 5 days ago
Comment by tomaspiaggio12 5 days ago