Agents that run while I sleep
Posted by aray07 6 hours ago
Comments
Comment by hermit_dev 1 minute ago
Comment by egeozcan 5 hours ago
The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.
Comment by magicalist 5 hours ago
It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.
I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.
Reward hacking is very real and hard to guard against.
Comment by egeozcan 4 hours ago
The concept is:
Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.
Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.
Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.
It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.
Comment by w4yai 2 hours ago
What kind of setup do you use ? Can you share ? How much does it cost ?
Comment by throwaway7783 8 minutes ago
It works wonderfully well. Costs about $200USD per developer per month as of now.
Comment by aprdm 1 hour ago
Comment by mrbungie 28 minutes ago
Comment by dworks 2 hours ago
(I built it)
Comment by cheema33 28 minutes ago
Comment by _ink_ 2 hours ago
Comment by stavros 2 hours ago
Comment by canadiantim 1 hour ago
Comment by nojito 44 minutes ago
Comment by throwaway7783 7 minutes ago
Comment by canadiantim 27 seconds ago
Comment by canadiantim 8 minutes ago
Comment by tomtom1337 4 hours ago
And do you have any prompts to share?
Comment by throwaway7783 3 minutes ago
* There is a lot of duplication between A & B. Refactor this.
* Look at ticket X and give me a root cause
* Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds
Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc
Comment by xienze 2 hours ago
Comment by habinero 2 hours ago
Comment by gedy 1 hour ago
Comment by skybrian 4 hours ago
Comment by egeozcan 4 hours ago
To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.
I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.
Comment by achierius 3 hours ago
How do you know that it works then? Are you using a different tool that does support it?
Comment by skybrian 4 hours ago
Comment by ssk42 2 hours ago
Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.
So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.
Comment by skybrian 25 minutes ago
Comment by lagrange77 5 hours ago
Is it really about rewards? Im genuinely curious. Because its not a RL model.
Comment by gbnwl 4 hours ago
Comment by hexaga 3 hours ago
And with that comes reward hacking - which isn't really about looking for more reward but rather that the model has learned patterns of behavior that got reward in the train env.
That is, any kind of vulnerability in the train env manifests as something you'd recognize as reward hacking in the real world: making tests pass _no matter what_ (because the train env rewarded that behavior), being wildly sycophantic (because the human evaluators rewarded that behavior), etc.
Comment by lagrange77 53 minutes ago
Hm, as i understand it, parts of the training of e.g. ChatGPT could be called RL models. But the subject to be trained/fine tuned is still a seq2seq next token predictor transformer neural net.
Comment by hexaga 11 minutes ago
Comment by magicalist 4 hours ago
Ha, good point. I was using it informally (you could handwave and call it an intrinsic reward if a model is well aligned to completing tasks as requested), but I hadn't really thought about it.
Searching around, it seems like I'm not alone, but it looks like "specification gaming" is also sometimes used, like: https://deepmind.google/blog/specification-gaming-the-flip-s...
Comment by nurettin 4 hours ago
Comment by SoftTalker 4 hours ago
Comment by gchamonlive 4 hours ago
Comment by bluGill 3 hours ago
the above is really hard. A lot of tdd 'experts' don't understand is and teach fragile tests that are not worth having.
Comment by 8note 47 minutes ago
your implementation is your interface. its a bit naive or hating-your-users to assume your tests are what your users care about. theyre dealing with everything, regardless of what youve tested or not.
Comment by switchbak 1 hour ago
You can change an interface and not change the behaviour.
I have rarely heard such a rigid interpretation such as this.
Comment by magicalist 4 hours ago
Comment by joegaebel 1 hour ago
https://www.joegaebel.com/articles/principled-agentic-softwa... https://github.com/JoeGaebel/outside-in-tdd-starter
Comment by SequoiaHope 4 hours ago
[1] https://simonwillison.net/guides/agentic-engineering-pattern...
Comment by codybontecou 5 hours ago
Comment by elemeno 5 hours ago
You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).
Comment by pastescreenshot 5 hours ago
Comment by huslage 3 hours ago
Comment by dworks 2 hours ago
Comment by dmd 5 hours ago
Comment by irishcoffee 5 hours ago
s/liberty/knowledge
Comment by osigurdson 3 hours ago
Comment by afro88 5 hours ago
Comment by Skidaddle 4 hours ago
Comment by aray07 5 hours ago
Comment by darkbatman 5 hours ago
Comment by recroad 4 hours ago
I also spend most of my time reviewing the spec to make sure the design is right. Once I'm done, the coding agent can take 10 minutes or 30 minutes. I'm not really in that much of a rush.
Comment by mjrbrennan 1 hour ago
Comment by genghisjahn 3 hours ago
Comment by paganel 3 hours ago
I still think that we, programmers, having to pay money in order to write code is a travesti. And I'm not talking about paying the license for the odd text editor or even for an operating system, I'm talking about day-to-day operations. I'm surprised that there isn't a bigger push-back against this idea.
Comment by switchbak 1 hour ago
Comment by jeremyjh 2 hours ago
Comment by what 2 hours ago
Comment by mr-wendel 19 minutes ago
Fortunately, there was enough work to be done so productivity increases didn't decrease my billable hours. Even if it did, I still would have done it. If it helps me help others, then it's good for my reputation. Thats hard to put a price on, but absolutely worth what I paid in this case.
Comment by fwip 2 hours ago
Comment by xandrius 2 hours ago
Comment by the_af 41 minutes ago
Comment by eKIK 2 hours ago
It's usually not about the price, but more about the fact that a few megacorps and countries "own" the ability to work this way. This leads to some very real risks that I'm pretty sure will materialize at some point in time, including but not limited to:
- Geopolitical pressure - if some ass-hat of a president hypothetically were to decide "nuh uh - we don't like Spain, they're not being nice to us!", they could forbid AI companies to deliver their services to that specific country.
- Price hikes - if you can deliver "$100 worth of value" per hour, but "$1000 worth of value" per hour with the help of AI, then provider companies could still charge up to $899 per hour of usage and it'd still make "business sense" for you to use them since you're still creating more value with them than without them.
- Reduction in quality - I believe people who were senior developers _before_ starting to use AI assisted coding are still usually capable of producing high quality output. However every single person I know who "started coding" with tools like Claude Code produce horrible horrible software, esp. from a security p.o.v. Most of them just build "internal tools" for themselves, and I highly encourage that. However others have pursued developing and selling more ambitious software...just to get bitten by the fact that it's much more to software development than getting semi-correct output from an AI agent.
- A massive workload on some open source projects. We've all heard about projects closing down their bug bounty programs, declining AI generated PRs etc.
- The loss of the joy - some people enjoy it, some people don't.
We're definitely still in the early days of AI assisted / AI driven coding, and no one really knows how it'll develop...but don't mistake the bubble that is HN for universal positivity and acclaim of AI in the coding space :).
Comment by ge96 3 hours ago
Comment by aray07 2 hours ago
Comment by throwaway7783 10 minutes ago
Honestly, sometimes the harnesses, specs, some predefined structure for skills etc all feel over-engineering. 99% of the time a bloody prompt will do. Claude Code is capable of planning, spawning sub-agents, writing tests and so on.
Claude.md file with general guidelines about our repo has worked extraordinarily good, without any external wrappers, harnesses or special prompts. Even the MD file has no specific structure, just instructions or notes in English.
Comment by bhouston 5 hours ago
https://benhouston3d.com/blog/the-rise-of-test-theater
You have to actively work against it.
Comment by joegaebel 31 minutes ago
I've written about this and have a POC here for those interested: https://www.joegaebel.com/articles/principled-agentic-softwa...
Comment by JBorrow 3 hours ago
Comment by jakewins 5 hours ago
Comment by aray07 4 hours ago
Comment by seanmcdirmid 5 hours ago
1. one agent writes/updates code from the spec
2. one agent writes/updates tests from identified edge cases in the spec.
3. a QA agent runs the tests against the code. When a test fails, it examines the code and the test (the only agent that can see both) to determine blame, then gives feedback to the code and/or test writing agent on what it perceives the problem as so they can update their code.
(repeat 1 and/or 2 then 3 until all tests pass)
Since the code can never fix itself to directly pass the test and the test can never fix itself to accept the behavior of the code, you have some independence. The failure case is that the tests simply never pass, not that the test writer and code writer agents both have the same incorrect understanding of the spec (which is very improbable, like something that will happen before the heat death of the universe improbable, it is much more likely the spec isn't well grounded/ambiguous/contradictory or that the problem is too big for the LLM to handle and so the tests simply never wind up passing).
Comment by jeremyjh 3 hours ago
Comment by seanmcdirmid 3 hours ago
Comment by wesselbindt 1 hour ago
Comment by gedy 1 hour ago
Comment by hinkley 48 minutes ago
Comment by RealityVoid 5 hours ago
Comment by SoftTalker 5 hours ago
> Most teams don't [write tests first] because thinking through what the code should do before writing it takes time they don't have.
It's astonishing to me how much our industry repeats the same mistakes over and over. This doesn't seem like what other engineering disciplines do. Or is this just me not knowing what it looks like behind the curtain of those fields?
Comment by yurishimo 5 hours ago
I like to think that people writing actual mission critical software try their absolute best to get it right before shipping and that the rest our industry exists in a totally separate world where a bug in the code is just actually not that big of a deal. Yeah, it might be expensive to fix, but usually it can be reverted or patched with only an inconvenience to the user and to the business.
It’s like the fines that multinational companies pay when breaking the law. If it’s a cost of doing business, it’s baked into the price of the product.
You see this also in other industries. OSHA violations on a residential construction site? I bet you can find a dozen if you really care to look. But 99% of the time, there are no consequences big enough for people to care so nobody wears their PPE because it “slows them down” or “makes them less nimble”. Sound familiar?
Comment by gonzalohm 3 hours ago
With other engineering professions, all projects are like that. You cannot "deploy a bridge to production" to see what happens and fix it after a few have died
Comment by tibbar 5 hours ago
Comment by InsideOutSanta 4 hours ago
So now people just ignore broken tests.
> Claude, please implement this feature.
> Claude, please fix the tests.
The only thing we've gained from this is that we can brag about test coverage.
Comment by hinkley 24 minutes ago
These are the only tests I've witnessed people delete outright when the requirements change. Anything more complex than this, they'll worry that there's some secondary assertion being implied by a test so they can't just delete it.
Which, really is just experience telling them that the code smells they see in the tests are actually part of the test.
meanwhile:
it("only has one shipping address", ...
is demonstrably a dead test when the story is, "allow users to have multiple shipping addresses", as is a test that makes sure balances can't go negative when we decide to allow a 5 day grace period on account balances. But if it's just one of six asserts in the same massive tests, then people get nervous and start losing time.Comment by ForHackernews 3 hours ago
Comment by hinkley 22 minutes ago
Comment by mattmanser 4 hours ago
But hey, we're just supposed to let the AIs run wild and rewrite everything every change so maybe that's a heretic view.
Comment by aray07 5 hours ago
Comment by jc-myths 18 minutes ago
I've also given it explicit rules like "never use placeholder images, always generate real assets" — and it just... ignores them sometimes. Not always. Sometimes. Which is worse, because you can't trust it but you also can't not use it.
The 80% it writes is fine. The problem is you still have to verify 100% of it.
Comment by itissid 2 hours ago
One example I have been experimenting is using Learning Tests[1]. The idea is that when something new is introduced in the system the Agent must execute a high value test to teach itself how to use this piece of code. Because these should be high leverage i.e. they can really help any one understand the code base better, they should be exceptionally well chosen for AIs to use to iterate. But again this is just the expert-human judgement complexity shifted to identifying these for AI to learn from. In code bases that code Millions of LoC in new features in days, this would require careful work by the human.
[1] https://anthonysciamanna.com/2019/08/22/the-continuous-value...
Comment by jdlshore 5 hours ago
TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.
TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”
Thank you for coming to my TED^H^H^H TDD talk.
Comment by wnevets 5 hours ago
I would like to emphasize that feedback includes being alerted to breaking something you previously had working in a seemly unrelated/impossible way.
Comment by hinkley 20 minutes ago
Comment by hinkley 21 minutes ago
Comment by TonyAlicea10 4 hours ago
But review fatigue and resulting apathy is real. Devs should instead be informed if incorrect code for whatever feature or process they are working on would be high-risk to the business. Lower-risk processes can be LLM-reviewed and merged. Higher risk must be human-reviewed.
If the business you're supporting can't tolerate much incorrectness (at least until discovered), than guess what - you aren't going to get much speed increase from LLMs. I've written about and given conference talks on this over the past year. Teams can improve this problem at the requirements level: https://tonyalicea.dev/blog/entropy-tolerance-ai/
Comment by skyberrys 50 minutes ago
Comment by silentsvn 3 hours ago
The architecture we landed on: ingest goes through a certainty scoring layer before storage. Contradictions get flagged rather than silently stacked. Memories that get recalled frequently get promoted; stale ones fade.
It's early but the difference in agent coherence over long sessions is noticeable. Happy to share more if anyone's going down this path.
Comment by daxfohl 3 hours ago
What if instead, the goal of using agents was to increase quality while retaining velocity, rather than the current goal of increasing velocity while (trying to) retain quality? How can we make that world come to be? Because TBH that's the only agentic-oriented future that seems unlikely to end in disaster.
Comment by rglover 3 hours ago
Comment by afro88 5 hours ago
Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?
Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.
One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
Comment by lbreakjai 2 hours ago
Same as before. Small PRs, accept that you won't ship a month of code in two days. Pair program with someone else so the review is just a formality.
The value of the review is _also_ for someone else to check if you have built the right thing, not just a thing the right way, which is exponentially harder as you add code.
Comment by eikenberry 1 hour ago
Comment by akshaysg 5 hours ago
Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).
IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.
Comment by kwanbix 5 hours ago
Comment by woah 1 hour ago
Comment by kg 5 hours ago
If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.
You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.
Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?
Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.
Comment by aray07 5 hours ago
i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)
Comment by zer00eyz 5 hours ago
Code review is a skill, as is reading code. You're going to quickly learn to master it.
> It's like 20k of line changes over 30-40 commits.
You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.
> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.
Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.
But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.
Comment by afro88 3 hours ago
Comment by logicchains 5 hours ago
Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.
Comment by overfeed 3 hours ago
To everyone who plan on automating themselves out of a job by taking the human element out- this is the endgame that management wants: replacing your (expensive and non-tax-optimized) labor with scalable Opex.
Comment by hinkley 46 minutes ago
Comment by OsrsNeedsf2P 5 hours ago
Edit: I even have a skill called release-test that does manual QA for every bug we've ever had reported. It takes about 10 hours to run but I execute it inside a VM overnight so I don't care.
Comment by 8note 30 minutes ago
i let it run overnight against a windows app i was working on, and that got it from mostly not working to mostly working.
the loop was
1. look at the code and specs to come up with tests 2. predict the result 3. try it 4. compare the prediction against rhe result 5. file bug report, or call it a success
and then switch to bug fixing, and go back around again. Worked really well in geminicli with the giant context window
Comment by Havoc 5 hours ago
Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.
Comment by aray07 5 hours ago
Comment by didgeoridoo 5 hours ago
Comment by vidimitrov 4 hours ago
But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.
The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.
Comment by dwaltrip 3 hours ago
Comment by vidimitrov 3 hours ago
Comment by rrvsh 2 hours ago
Comment by svstoyanovv 1 hour ago
Comment by storus 5 hours ago
Comment by 8note 28 minutes ago
maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct
Comment by throwatdem12311 16 minutes ago
Comment by osigurdson 3 hours ago
Comment by lateforwork 5 hours ago
You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.
Comment by xandrius 2 hours ago
Comment by aray07 5 hours ago
Comment by digitalPhonix 5 hours ago
That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”
Comment by aray07 5 hours ago
Comment by keyle 1 hour ago
Don't get me wrong, I use agentic coding often, when I feel it's going to type it faster than me (e.g. a lot of scaffolding and filler code).
Otherwise, what's the point?
I feel the whole industry is having its "Look ma! no hands!" moment.
Time to mature up, and stop acting like sailing is going where the seas take you.
Comment by BeetleB 5 hours ago
Comment by joegaebel 28 minutes ago
Comment by simlevesque 5 hours ago
[1] https://code.claude.com/docs/en/devcontainer
If you want to try it just ask Claude to set it up for your project and review it after.
Comment by comradesmith 5 hours ago
It will probably comply, and at least if it does change the tests you can always revert those files to where you committed them
Comment by tavavex 5 hours ago
You could probably make a system-level restriction so the software physically can't modify certain files, but I'm not sure how well that's going to fly if the program fails to edit it and there's no feedback of the failure.
Comment by mgrassotti 5 hours ago
With this approach you can enforce that Claude cannot access to specific files. It’s a guarantee and will always work, unlike a prompt or Claude.md which is just a suggestion that can be forgotten or ignored.
This post has an example hook for blocking access to sensitive files:
https://aiorg.dev/blog/claude-code-hooks#:~:text=Protect%20s...
Comment by BeetleB 5 hours ago
Comment by paxys 5 hours ago
Comment by pfortuny 5 hours ago
One could even make zero-knowledge test development this way.
Comment by aray07 5 hours ago
Comment by SatvikBeri 5 hours ago
Comment by BeetleB 5 hours ago
Comment by SatvikBeri 4 hours ago
Comment by kubb 5 hours ago
Comment by dboreham 5 hours ago
Comment by jaggederest 5 hours ago
Comment by throwyawayyyy 5 hours ago
Comment by lbreakjai 2 hours ago
Outage is the easy failure mode. I can work around a service that's up 80% of the time, but is 100% correct. A service that's up 100% of the time but is 80% correct is useless.
Comment by foundatron 4 hours ago
I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.
On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.
The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.
Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.
Comment by tayo42 5 hours ago
Comment by dzuc 5 hours ago
Comment by emirhan_demir 3 hours ago
Comment by monooso 4 hours ago
Comment by fragmede 5 hours ago
Comment by joegaebel 21 minutes ago
Comment by apsdsm 2 hours ago
If you don’t trust the agent to do it right in the first place why do you trust them to implement your tests properly? Nothing but turtles here.
Comment by dune-aspen 2 hours ago
Comment by broDogNRG 5 hours ago
Comment by LingoChat 5 hours ago
Comment by webpolis 3 hours ago
Comment by iam_circuit 1 hour ago
Comment by rob 1 hour ago
Comment by ekropotin 1 hour ago
The thing is, LLMs are probabilistic data structures, and the probability of incorrect final output is proportional to both amount of turns made and amount of agents run simultaneously. In practice it means you almost never end up with desired result after a long loop.
Comment by zazibar 1 hour ago
Comment by frenchtoast8 46 minutes ago