Levels of Agentic Engineering
Posted by bombastic311 16 hours ago
Comments
Comment by captainkrtek 26 minutes ago
The idea that Claude/Cursor are the new high level programming language for us to work in introduces the problem that we're not actually committing code in this "natural language", we're committing the "compiled" output of our prompting. Which leaves us reviewing the "compiled code" without seeing the inputs (eg: the plan, prompt history, rules, etc.)
Comment by vidimitrov 5 hours ago
When the author talks about "codifying" lessons, the instinct for most people is to update the rules file. That works fine for conventions - naming patterns, library preferences, relatively stable stuff. But there's a different category of knowledge that rules files handle poorly: the why behind decisions. Not what approach was chosen, but what was rejected and why the tradeoff landed where it did.
"Never use GraphQL for this service" is a useful rule to have in CLAUDE.md. What's not there: that GraphQL was actually evaluated, got pretty far into prototyping, and was abandoned because the caching layer had been specifically tuned for REST response shapes, and the cost of changing that was higher than the benefit for the team's current scale. The agent follows the rule. It can't tell when the rule is no longer load-bearing.
The place where this reasoning fits most naturally is git history - decisions and rejections captured in commit messages, versioned alongside the code they apply to. Good engineers have always done this informally. The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory.
At level 7, this matters more than people expect. Background agents running across sessions with no human-in-the-loop have nothing to draw on except whatever was written down. A stale rules file in that context doesn't just cause mistakes - it produces confident mistakes.
Comment by sd9 2 hours ago
"Where most [X] [Y]" is an up and coming LLM trope, which seems to have surfaced fairly recently. I have no idea why, considering most claims of that form are based on no data whatsoever.
Comment by solarkraft 1 hour ago
> The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory
Because I somewhat agree that discipline may be missing, but I don’t believe it to be a groundbreaking revelation that it’s actually quite easy to tell the LLM to put key reasoning that you give it throughout the conversation into the commits and issue it works on.
Comment by svstoyanovv 1 hour ago
Is this fundamentally different from using a ghostwriter, an editor, or a highly advanced compiler? If I am doing the heavy lifting of context engineering and knowledge discovery, it feels restrictive to say I shouldn't utilize an LLM to structure the final output. Yet, the internet still largely views any AI-generated text as inherently "un-human" or low-effort.
Comment by sd9 1 hour ago
And because the cost of generating the comments is so low, there’s no longer an implicit stamp of approval from the author. It used to be the case that you could kind of engage with a comment in good faith, because you knew somebody had spent effort creating it so they must believe it’s worth time. Even on a semi-anonymous forum like HN, that used to be a reliable signal.
So a lot of the old heuristics just don’t work on LLM-generated comments, and in my experience 99% of them turn out to be worthless. So the new heuristic is to avoid them and point them out to help others avoid them.
I would much rather just read the prompt.
Comment by blackcatsec 1 hour ago
This has severe ramifications for internet communications in general on forums like HN and others, where it seems LLM-written comments are sneaking in pretty much everywhere.
It's also very, very dangerous :/ Because the structure of the writing falsely implies authority and trust where there shouldn't be, or where it's not applicable.
Comment by smallnix 2 hours ago
Comment by vidimitrov 2 hours ago
Comment by mzg 6 hours ago
If software engineering is enough of a solved problem that you can delegate it entirely to LLM agents, what part of it remains context-specific enough that it can’t be better solved by a general-purpose software factory product? In other words, if you’re a company that is using LLMs to develop non-AI software, and you’ve built a sufficient factory to generate that software, why don’t you start selling the factory instead of whatever you were selling before? It has a much higher TAM (all of software)
Comment by hakanderyal 6 hours ago
Comment by 2001zhaozhao 5 hours ago
If you could get a dark factory working when others don't have one, you can make much more money using it than however much you can make selling it
Comment by tkiolp4 2 hours ago
Comment by antonvs 3 hours ago
So far, we haven’t seen much to suggest that LLMs can (yet) replace sales and most of the related functions.
Comment by whattheheckheck 4 hours ago
Comment by glhast 5 hours ago
Feels like K8s cult, overly focused on the cleverness of _how_ something is built versus _what_ is being built.
Comment by pydry 6 hours ago
Comment by dist-epoch 6 hours ago
And when they will be fully dark factories, yes, what will happen is that a LOT of software companies will just disappear, they will be dis-intermediated by Codex/Claude Code.
Comment by david_iqlabs 20 minutes ago
Comment by jjmarr 6 hours ago
It's very powerful and agents can create dynamic microbenchmarks and evaluate what data structure to use for optimal performance, among other things.
I also have validation layers that trim hallucinations with handwritten linters.
I'd love to find people to network with. Right now this is a side project at work on top of writing test coverage for a factory. I don't have anyone to talk about this stuff with so it's sad when I see blog posts talking about "hype".
Comment by moosehater 3 hours ago
Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.) And if not, whether or not you think that matters.
Comment by jjmarr 2 hours ago
> Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I divide my work into vibecoding PoC and review. Only once I have something working do I review the code. And I do so through intense interrogation while referencing the docs.
> I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.)
Level 8 only works in production for a defined process where you don't need oversight and the final output is easy to trust.
For example, I made a code review tool that chunks a PR and assigns rule/violation combos to agents. This got a 20% time to merge reduction and catches 10x the issues as any other agent because it can pull context. And the output is easy to incorporate since I have a manager agent summarize everything.
Likewise, I'm working on an automatic performance tool right now that chunks code, assigns agents to make microbenchmarks, and tries to find optimization points. The end result should be easy to verify since the final suggestion would be "replace this data structure with another, here's a microbenchmark proving so".
Comment by moosehater 2 hours ago
Also would be interested in an example of "validation layers that trim hallucinations with handwritten linters" but understand if that's not something you can share. Either way, thanks for responding!
Comment by jjmarr 1 hour ago
For code review, AI doesn't want to output well-formed JSON and oftentimes doesn't leave inline suggestions cleanly. So there's a step where the AI must call a script that validates the JSON and checks if applying the suggestion results in valid code, then fixes the code review comments until they do.
Comment by jessmartin 5 hours ago
Would be happy to swap war stories.
<myhnusername>@gmail.com
Comment by whattheheckheck 4 hours ago
Comment by nimasadri11 6 hours ago
> Look at your app, describe a sequence of changes out loud, and watch them happen in front of you.
The problem a lot of times is that either you don't know what you want, or you can't communicate it (and usually you can't communicate it properly because you don't know exactly what you want). I think this is going to be the bottleneck very soon (for some people, it is already the bottleneck). I am curious what are your thoughts about this? Where do you see that going, and how do you think we can prepare for that and address that. Or do you not see that to be an issue?
Comment by smallnix 2 hours ago
Comment by holtkam2 5 hours ago
Level 12: agent superintelligence - single entity doing everything
Level 13: agent superagent, agenting agency agentically, in a loop, recursively, mega agent, agentic agent agent agency super AGI agent
Level 14: A G E N T
Comment by zenoprax 4 hours ago
Comment by clickety_clack 4 hours ago
Comment by dweinus 1 hour ago
Comment by stale2002 1 hour ago
Comment by ftkftk 6 hours ago
Comment by tkiolp4 2 hours ago
Like imagine if you could go back in time and servlets and applets are the big new thing. You wouldn’t like to spend your time learning about those technologies, but your boss would be constantly telling that it is the future. So boring
Comment by hansonkd 1 hour ago
Comment by kantselovich 1 hour ago
Also, I’m struggling to take it to multiple agents level, mostly because things depend on each other in the project - most changes cut across UI, protocol and the server side, so not clear how agents would merge incompatible versions.
Verification is a tricky part as well, all tests could be passing, including end to end integration and visual tests, but my verification still catches things like data is not persisted or crypto signatures not verified.
Comment by eikenberry 6 hours ago
Comment by dist-epoch 6 hours ago
Comment by eikenberry 5 hours ago
Comment by philipp-gayret 2 hours ago
I've experimented with agent teams. However the current implementation (in Claude Code) burns tokens. I used 1 prompt to spin up a team of 9+ agents: Claude Code used up about 1M output tokens. Granted, it was a long; very long horizon task. (It kept itself busy for almost an hour uninterrupted). But 1M+ output tokens is excessive. What I also find is that for parallel agents, the UI is not good enough yet when you run it in the foreground. My permission management is done in such a way that I almost never get interrupted, but that took a lot of investment to make it that way. Most users will likely run agent teams in an unsafe fashion. From my point of view the devex for agent teams does not really exist yet.
Comment by Aperocky 3 hours ago
That's a smell for where the author and maybe even the industry is.
Agents don't have any purpose or drive like human do, they are probabilistic machines, so eventually they are limited by the amount of finite information they carry. Maybe that's what's blocking level 8, or blocking it from working like a large human organization.
Comment by efsavage 6 hours ago
I think eventually 4-8 will be collapsed behind a more capable layer that can handle this stuff on its own, maybe I tinker with MCP settings and granular control to minmax the process, but for the most part I shouldn't have to worry about it any more than I worry about how many threads my compiler is using.
Comment by taude 1 hour ago
Comment by lherron 6 hours ago
Comment by ramesh31 5 hours ago
I thought level 8 was a joke until Claude Code agent teams. Now I can't even imagine being limited to working with a single agent. We will be coordinating teams of hundreds by years end.
Comment by bigwheels 3 hours ago
https://factory.strongdm.ai/techniques
Techniques covered in-depth + Attractor open source implementations:
https://factory.strongdm.ai/products/attractor#community
https://github.com/search?q=strongdm+attractor&type=reposito...
https://github.com/strongdm/attractor/forks
I'm continuing to study and refine my approach to leverage all this.
Comment by CuriouslyC 3 hours ago
Spec driven development can reduce the amount of re-implementation that is required due to requirements errors, but we need faster validation cycles. I wrote a rant about this topic: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
Comment by politelemon 6 hours ago
Comment by Arainach 3 hours ago
Until you build an AI oncaller to handle customer issues in the middle of the night (and depending on your product an AI who can be fired if customer data is corrupted/lost), no team should be willing to remove the "human reviews code step.
For a real product with real users, stability is vastly more important than individual IC velocity. Stability is what enables TEAM velocity and user trust.
Comment by osigurdson 2 hours ago
Comment by smy20011 7 hours ago
Comment by jakejmnz 1 hour ago
Comment by sjkoelle 7 hours ago
Comment by jackby03 6 hours ago
Comment by ramoz 3 hours ago
I spend a great deal of my time planning and assessing/reviewing through various mechanisms. I think I do codify in ways when I create a skill for any repeated assessment or planning task.
> To be clear, planning as a general practice isn't going away. It's just changing shape. For newer practitioners, plan mode remains the right entry point (as described in Levels 1 and 2). But for complex features at Level 7, "planning" looks less like writing a step-by-step outline and more like exploration: probing the codebase, prototyping options in worktrees, mapping the solution space. And increasingly, background agents are doing that exploration for you.
I mean, it's worth noting that a lot of plan modes are shaped to do the Socratic discovery before creating plans. For any user level. Advanced users probably put a great deal of effort (or thought) into guiding that process themselves.
> ralph loops (later on)
Ralph loops have been nothing but a dramatic mess for me, honestly. They disrupt the assessment process where humans are needed. Otherwise, don't expect them to go craft out extensive PRD without massive issues that is hard to review.
- It would seem that this is a Harness problem in terms of how they keep an agent working and focused on specific tasks (in relation to model capability), but not something maybe a user should initiate on their own.Comment by C0ldSmi1e 5 hours ago
Comment by dolebirchwood 5 hours ago
Maybe it's just me, but I don't see the appeal in verbal dictation, especially where complexity is involved. I want to think through issues deliberately, carefully, and slowly to ensure I'm not glossing over subtle nuances. I don't find speaking to be conducive to that.
For me, the process of writing (and rewriting) gives me the time, space, and structure to more precisely articulate what I want with a more heightened degree of specificity. Being able to type at 80+ wpm probably helps as well.
Comment by wild_egg 2 hours ago
Stream of consciousness typing for me is still slower and causes me to buffer and filter more and deliberately crafting a perfect prompt is far slower still.
LLMs are great at extracting the essence of unstructured inputs and voice lets me take best advantage of that.
Voice output, on the other hand, is completely useless unless perhaps it can play at 4x speed. But I need to be able to skim LLM output quickly and revisit important points repeatedly. Can't see why I'd ever want to serialize and slow that down.
Comment by ramesh31 5 hours ago
This is increasingly untrue with Opus 4.6. Claude Max gives you enough tokens to run ~5-10 agents continuously, and I'm doing all of my work with agent teams now. Token usage is up 10x or more, but the results are infinitely better and faster. Multi-agent team orchestration will be to 2026 what agents were to 2025. Much of the OP article feels 3-6 months behind the times.
Comment by measurablefunc 6 hours ago