Don't trust large context windows
Posted by computersuck 3 days ago
Comments
Comment by dofm 3 days ago
But I am struggling to put into words how alarming I find the comments on threads like this — all sorts of good-natured anecdotes about how XYZ works for them that are more like the suggestions in pet care or cookery threads on Facebook.
(Or worse still, like any Facebook 3D printing group: anyone who prints but wants to understand what is actually going on will know what I mean, I think)
Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
Have you tried cleaning your context with dawn dish soap, letting it dry and then adding a layer of glue stick?
--
ETA: I don't want to sound so mean about people who try to help, here or in facebook groups. I guess I just find these threads so different to threads on more or less any other topic, where someone's suggestion can be debated or refined by other commenters and then someone will explain a thing about how bash history selections work that will change your entire life. With these threads they devolve to "isn't it weird that threatening it works?"
Comment by 01100011 3 days ago
But damn, agents are amazing and I'm enjoying being a "thought process designer". I'm not going back. Even if AI development stops today my career will never be the same.
Comment by nk91 3 days ago
I’m working on a tiny agent harness at home to learn and the process of taking human speech and turning it into agent tool calls that output something generally deterministic depending on how the tool is defined is so interesting.
One of the big takeaways is you really only have to rely on the non-determinism<->determinism translation layer once when you switch between the two domains. You can obviously rely on it more if you want, and that’s probably faster because determinism is hard, but you don’t need too do that.
Comment by port11 3 days ago
Comment by nk91 3 days ago
For voice related things you have a lot of turn of phrase scenarios that can make no sense unless you know. Phrasing like “Put Larry on the horn.” For someone familiar with old lingo for phone calls makes sense. For someone else they might think of a war horn, someone else a music class.
All of those are wildly different situations. It’s not hard to see how one oops between two non deterministic things can quickly go off the rails.
The fact we can get away with so much non-determinism->non-determinism recursion is frankly amazing when you realize how easy it is to imprecisely describe what it is you’re thinking.
Comment by port11 3 days ago
npx tsc
bash tsc
bash npx tsc
npm run build
…
I’m not an expert at all on the subject matter, but is it impossible to train a model that calls tools in a (quasi-)deterministic way?Comment by newterminus 2 days ago
Comment by coldtea 2 days ago
Would just make the answer to the same exact prompt X repeatedly the same.
It wouldn't change the fact that prompt X', functionaly indistinguisable from X, aside from small phrasing changes, can give a totally different looking answer.
Comment by newterminus 1 day ago
Comment by ACCount37 3 days ago
Can't help but feel like a lot of people who are deep in IT made it there because they hated working with humans.
Comment by benashford 3 days ago
There was always some of this in the tech world, long before LLMs came along.
I've sat in so many meetings when decisions were made based on "that's what _slightly more prestigious company_ does" rather than objective measurable criteria. (And the evidence that the thing in question wasn't universally followed by _slightly more prestigious company_ carried surprisingly little weight).
Comment by dofm 3 days ago
But people are now individually acting this way on their desks on an hour by hour basis. LLMs make cargo-culting inevitable because they are inscrutable and opaque.
There is always this sense in the LLM-proponent world that LLMs are at any moment as bad as they are ever going to be; line goes up.
But it seems clear that the gap between perceived and measurable productivity is still likely spent in poking entrails with a stick.
We are so used to probabilistic tools that have significant setup time before they become valuable and save us loads of time that we're at risk of repeatedly writing off that setup time without seeing the rewards, believing that one day it will actually work out that way.
(Which is most recognisable from the early JS frontend frameworks era.)
Meantime here we have an article that shows that a thing (longer context windows) that people thought would functionally solve a problem so we would get the value from all that setup does not, in fact, very meaningfully kick it down the road, and the comments are still full of entrails-and-stick work.
Comment by CommieBobDole 3 days ago
Heck, even the 'benchmarks' are mostly somebody's attempt to crystallize their vibes with varying amounts of success.
Comment by esperent 3 days ago
Have you ever tried doing evals on moderately complex but bounded tasks?
I spent some time doing it when testing these "token reducing" tools like Headroom, RTK etc. as well as customizing my Pi tools. What I found interesting was that despite LLMs being deterministic, for a given toolset and prompt, the results were highly consistent for a given eval, across multiple models (I tested at the time using GPT 5.4 mini, 5.5, 5.3 codex, Gemini 3 flash, initially running sets of 5 evals on each task but once I realized how consistent the results were, dropping to sets of 3.
Aside: in my tests, RTK and Headroom made the overall context use higher for roughly equivalent results. The context use for those specific toolcalls went down but the number of model turns and overall context use went up.
Comment by dofm 3 days ago
Comment by topspin 3 days ago
Consider that this shared sense of rigour you have in mind is illusory, and LLMs and their context struggles are simply revealing this. I see precious little rigour in any of the 'tech' world I've lived in for decades. The tools proliferate, paradigms emerge and die and reemerge, and whatever stick you consider using to measure any of it has competitors with different units. Past the physics of power and signaling, and the prevailing cost of a silicon wafer, we are almost all, relative to a small number of much older disciplines, muddlers of various degrees of skill.
I've found dealing with context limits relatively easy: specify and confine. LLMs need clear specifications and strong guidance to produce good work.
But that's just my current muddling take on the practice. Perhaps, 90 days from now, even this burden will be gone, and a simple prompt will generate world class operating systems, programming languages and a formal basis in mathematics for both.
Comment by coldtea 2 days ago
Just because it was nowhere near perfect, and a lot of "religion" and hand waving was involved, doesn't mean it was illusory.
It's enough that it existed quite more than it does in the LLM era for what the parent said to make sense.
Comment by iugtmkbdfil834 3 days ago
Comment by api 3 days ago
A lot of this is a result of systems having long ago exceeded the complexity threshold of things people can hold in their heads. There are too many layers, subsystems, languages, APIs, all glued together. Attempts at radical simplification fail because each of those layers and subsystems has features or behaviors someone needs, and a lot of it isn’t even documented.
AI takes this to the extreme. I’ve already learned that certain models have “personalities.” Some are more likely to go with you on magical journeys into hallucination while others are more critical. Some are better at detail while others seem better at abstraction but fall over on detail. Some are better instruction followers. All their quirks are complex and the systems themselves are impossible to understand.
Computer systems are becoming organic, biological.
Comment by dofm 3 days ago
But those technologies are layers, and there are reliable things that sometimes bubble across the boundaries — type hints, better code patterns to trigger compiler optimisation, interesting tricks with key column selection — and someone with expertise from that layer below can explain why, and their advice will always work in situations that are sufficiently similar.
You are right about AI personalities. Obvious even with the open weights models. Gemma and Qwen write code and documentation like people from different cultures. Because I guess they are a bit like that.
Comment by ACCount37 3 days ago
All "personality traits" within an LLM are entangled. So when you mid-train or post-train on ESL texts, or run RLHF using people from a given culture, you risk bleeding some of the related cultural traits into the LLM itself. A lot of the resulting "personality" is downstream from different AI teams picking different datasets and training signals.
RLAF is more of a "funhouse mirror" distortion - it takes existing traits and twists them, sometimes amplifies them to comical extremes. Weird can become weirder. A verbal tic can become a style signature. Part of the reason why AI writing from GPT-4 era and to now has changed so dramatically.
Comment by tonyarkles 3 days ago
> Have you tried cleaning your context with dawn dish soap
I don’t do the glue stick thing at all because I don’t need to, but Dawn really seems to do a good job at getting my Bambu build plate working again. I didn’t seek it out specifically, I already had some for doing dishes. IPA hadn’t worked so I tried Dawn and it has gotten me back having prints stick multiple times now. Not quite up to N=30 yet.
Comment by darkwater 3 days ago
Comment by skybrian 3 days ago
Comment by xnx 2 days ago
Some of the explanation is that these systems are hard to understand and changing fast. Another part of the explanation is that AI has removed so much of the need for the expertise that people have that they contrive complications so they can feel they are still doing something.
Comment by conditionnumber 3 days ago
Comment by tgv 3 days ago
Comment by data-ottawa 3 days ago
The LLM providers fine tune the models with some kind of information retrieval tasks, but to do so you must provide some non relevant context to bootstrap the session for the long context tasks.
It would be very easy to do this in ways that train the sequence model to treat early history as noisier than it really is, or to weaken its relationship to late context.
You’re also probably stacking more contexts together with long contexts (start with task A, then detour to solving B and C before you can complete A).
Training sequence lengths probably decay super linearly with length creating far fewer samples at long length during training.
Comment by krackers 1 day ago
The deepseek v4 paper talks about one variant of this (related to failures) and how they mitigate it.
>During preemption, we pause the inference engine and save the KV cache of unfinished requests. Upon resumption, we use the persisted WALs and saved KV cache to continue decoding. Even when a fatal hardware error occurs, we can re-run the prefill phase using the persisted tokens in WAL to reconstruct the KV cache.
>Importantly, it is mathematically incorrect to regenerate unfinished requests from scratch, as this introduces length bias. Because shorter responses are more likely to survive interruption, regenerating from scratch makes the model more prone to producing shorter sequences whenever an interruption occurs. If the inference stack is batch-invariant and deterministic, this correctness issue could also be addressed by regenerating with a consistent seed for the pseudorandom number generator used in the sampler. However, this approach still incurs the extra cost of re-running the decoding phase, making it far less efficient than our token-granular WAL method.
Comment by claw-el 3 days ago
Comment by casey2 2 days ago
Comment by cactusplant7374 3 days ago
It will always be this way going forward. Everyone thinks differently about problems. In the past we had experts and only they could do the work at a high level. But now we have many people that are cranking out expert level solutions without much knowledge. Worrying about the minutia is a dying trend.
Edit: I see I touched a nerve. But that is how it is now. You can't fight reality.
Comment by foobarchu 2 days ago
Defeatism doesn't do anything positive for the world. You're trying to convince the people pushing for a marginally better world that they should give up because it'll never happen. That is not a useful contribution.
Comment by vitally3643 3 days ago
Because that's what OP is talking about. Superstition presented as factual advice instead of the technically rigorous and scientific fact that already exists.
You're being downvoted because you don't understand this fact, or indeed understand what you're saying at all.
I'll spell it out for you: technically and scientifically rigorous facts do actually exist, even in regards to LLMs. We can, in fact, obtain scientific and objective facts about how LLMs perform. It can be rigorously proven that certain context habits affect certain tasks positively or negatively. Your argument is that none of this matters more than superstition. And you're surprised that arguing to a room full of engineers and scientists that science is dead and superstition is the one true way forward gives you negative response.
Comment by cactusplant7374 3 days ago
> I'll spell it out for you
You are a rude and crude individual. I am not interested in discussing anything further with you.
Comment by Revisional_Sin 3 days ago
Comment by dofm 3 days ago
I usually don't have to worry about compiler optimisations because compiler experts do that; sometimes they appear in a thread about code and say "compiler guy here — if you write your code like this the compiler can optimise it".
And that person will be provably right (or wrong), in that context. And it'll be the same each time you run the test!
I just… ehh. You make a good point and I worry you are not wrong. It's all so different.
I like my 3D printing analogy much more than I wish I did.
Comment by orbital-decay 2 days ago
That said, there's no cargo cult in blindly using heuristics for certain fundamental LLM phenomena that have tons of good studies backing them (e.g. have no extra distractors, group and delimit pieces of the context, etc). If you want quantitative rigor, perform correct evals on your specific task and model.
Comment by jayd16 3 days ago
Comment by dindunuf 3 days ago
and second of all...
>Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.
what "sense of rigour"? it's way too soon to put those rose-tinted glasses on.
Comment by JCTheDenthog 3 days ago
I don't think OP is claiming that prior to LLM coding everything in the software development world was super rigorous (I assume that's effectively what you mean with the "rose-tinted glasses" comment). But rigor was actually possible and in a deterministic way too, which is fundamentally impossible with LLMs. You can build all kinds of guardrails and processes around LLMs that make it somewhat approach rigor again, but it's still fundamentally based on a bunch of statistical probabilities instead of deterministic, repeatable results.
All of the methods I see to mitigate the fundamental and inherent issues of LLMs seem roughly equivalent to the kind of crap you see in astrology groups or palm reading etc. You need Venus and Mercury to be in alignment while Mars is retrograde if you want to be able to get the right results from your token predictor.
Comment by dofm 3 days ago
Comment by ninjalanternshk 3 days ago
Any software engineering practice that had enough review and feedback to work with humans should work more or less the same with AI coders.
It’s when someone tries replacing an entire team or an entire process with a single prompt that they get in trouble.
Comment by JCTheDenthog 3 days ago
Sure, but LLMs are non-deterministic in ways that no sane human ever would be. See the "Is it better to drive or walk to the carwash" scenario from a few months ago as one of many, many examples. Or a personal example I encountered just a week ago: I asked Claude (Opus 4.8 in case any of the "you aren't using the latest model that totally fixes that issue" types are interested) to convert a bunch of DB calls that currently use raw ADO.NET calls to use Dapper instead.
The projects in this repo were on .NET 4.8.1 and were still using the older format for the .csproj file instead of the newer (and far better) "SDK-style" format that Microsoft introduced a few years ago. It tried to use the dotnet CLI to add references to Dapper, even though the older format of .csproj doesn't work with that. The dotnet CLI returned errors about trying to add the package references for Dapper, which Claude completely ignored while continuing to try and convert the ADO.NET calls to Dapper. And at the end it tried building the project, which of course failed, and then it confidently informed me that the conversion had been completed successfully and that the build completed successfully and all tests were passing successfully, even though the output from the build it had done immediately prior clearly told the LLM otherwise.
A real human, despite being non-deterministic, would have caught the issue at multiple stages. They would have seen the error when trying to add the reference. If they ignored that then they would have seen the red squiggly lines all over the (deterministic) IDE telling them there was something wrong, along with autocomplete for Dapper calls not working. And if they continued to ignore those and managed to keep going anyways, they would have clearly seen that the build failed, with tons of errors specifically about references to Dapper failing to resolve. An LLM keeps going on its merry way in ways that effectively 0 humans would.
Comment by jayd16 3 days ago
They also don't learn, so they never get less unpredictable. You can't give the senior robot the production keys and expect it won't delete prod.
Comment by bob1029 3 days ago
I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.
There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.
Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.
Comment by KronisLV 3 days ago
Comment by wrs 3 days ago
In addition, you can co-author a plan for a biggish chunk of work, divided into stages, have it launch a subagent for phase 1 and check its work, then ESC-ESC to go back to just after you wrote the plan and have it do phase 2. Repeat until done. This keeps the overall goal in the main context for the review, but clears out previous reviews. Kind of like a workflow but with more control.
Comment by KronisLV 3 days ago
It's happened multiple times where I give it a task before going to sleep and when I come back it's stuck halfway through on some stupid summary, where my only response needed is basically "Yeah, continue." even though I use Opus. Using workflows for the higher level planning helped with that and those annoying pauses no longer happen, perhaps due to the main conversation being much shorter and apparently not enough for the weights to nudge towards user confirmation.
Comment by gbro3n 3 days ago
Comment by verdverm 3 days ago
I imagine most harnesses should have a way to do this today, if they don't, get a new one. OpenCode i.e. is highly customizable, Claude and VS Code both support a ton as well including custom agents (though unclear if you can create custom top-level in claude-code)
https://opencode.ai/docs/agents/
https://code.claude.com/docs/en/sub-agents
https://code.visualstudio.com/docs/agent-customization/custo...
Comment by gbro3n 3 days ago
Comment by verdverm 3 days ago
the main agent would be very different, basically an orchestrator, and you are "loop engineering" it, and turning off all the things for this main agent besides being able to run subagents
for opencode:
https://opencode.ai/docs/agents/#permissions (what tools, mcp, etc...)
https://opencode.ai/docs/agents/#task-permissions (what subagents it can call)
https://opencode.ai/docs/agents/#additional (thinking effort)
Comment by bob1029 3 days ago
Comment by el_benhameen 3 days ago
Comment by hadlock 3 days ago
Comment by nijave 3 days ago
I see it pretty frequently in troubleshooting and data analysis flows where it will dump the data collection and aggregation into a sub agent then pull out a summarized result.
I'll do something similar where I have the main agent maintain context in a design doc/markdown file and update as it goes along. Then I can clear/restart/handoff at will
Comment by port11 3 days ago
Comment by Muromec 3 days ago
In a way it's a generalization of the spec-driver approach, but in addition to the the formal spec the carryover buffer lives in the memory.
Comment by stogot 3 days ago
Comment by password4321 3 days ago
Comment by Jgrubb 3 days ago
Comment by ajmurmann 3 days ago
Comment by knollimar 2 days ago
Comment by ajmurmann 2 days ago
Comment by ViewTrick1002 3 days ago
Comment by loeg 3 days ago
AI vendors still need to compete with each other both in terms of token cost and competency. An agent that is costly and less effective by wasting tokens is less competitive.
Comment by kevincox 1 day ago
Makes me wonder if it would be best to have some sort of "fork" operation to start the new agent. Rather than starting from blank it inherits the existing context (which is already cached for evaluation) plus a bit on top for its specific task. Much like the system call there would essentially be two returns, the one in the agent says "You are the agent, perform the discussed work" and the parent gets the result produced by the agent.
Comment by pavlus 1 day ago
* push("What I'm about to do"),
* pop("What I've achieved").
"Push" marks the position in current context after the call and returns "Proceed", "pop" erases everything after the matching call and replaces "Proceed" in it's result as what was passed to the argument to the "pop", effectively pruning long-winded head even inside one reasoning stream. In the end the model only sees how it decided to do work on something, and that it was already done, forgetting everything it between, except what it itself decided was important.
Gemma 4 31B QAT successfully uses it when navigating a maze, marking positions at intersections, exploring them, navigating back and pushing again if necessary. Smaller models often fail to mark positions and forget to backtrack as well, instead they try to rely on themselves to track their paths and navigate back (and also fail).
I think it should work for long-running deep research tasks, but I was too lazy to test it, because it all required a lot of code to glue this up, since most tools and libs are not designed to work like that, and now I'll need even more code to test it, without a purposeful task.
Comment by Etheryte 3 days ago
Comment by bob1029 3 days ago
I don't bother warning it in the system prompt anymore. It's pointless. I let it bump its head as required. A few hundred tokens and the agent is back on track each time.
Comment by embedding-shape 3 days ago
Comment by andai 3 days ago
Comment by bob1029 3 days ago
[User] Actual human prompt
[Agent] Attempted use of tool & hand slap
[Agent] call(projection of user's prompt relative to discovered tool constraints)
["User"] Prompt from above call
[Agent] Legal tool use
[Agent] ... until satisfied
[Agent] return(summary that satisfies the prompt for this level of execution)
[Agent] Additional call() invokes possible depending on returned summary
[Agent] Final return(summary) from root ends this turn of conversation and user sees summary
[User] Next turn of conversation initiated by actual humanComment by throwaway314155 3 days ago
Comment by WithinReason 3 days ago
Comment by bob1029 3 days ago
The only tools permissible to root in my scheme are call() and return().
Comment by WithinReason 3 days ago
Comment by kelnos 3 days ago
(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)
I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.
Comment by Bolwin 3 days ago
Personally I consider < 60k to be the smart zone for opus. This is worse for opus 4.7 and 4.8 cause of the more granular tokenizer
Comment by eterm 3 days ago
60k isn't much bigger than the system prompt.
Comment by Bolwin 3 days ago
Plus I've found that the only time models go above 100k tokens anyway is when they've started looping at which point it's much better to go back anyway.
Anecdotally most models know their recall is terrible (or have been trained to act as such), that's why they constantly reread files before editing or while reasoning.
Comment by danielbln 3 days ago
Comment by qsera 3 days ago
Comment by kuboble 3 days ago
It seems that people have different workflows or repos, or memories or prompts or expectations.
Comment by diab0lic 3 days ago
Comment by kuboble 3 days ago
I read it as a models performance being random and observed differences in the opinions are the results of the overinterpretation of the random outcomes.
I think however that some people seem to be always lucky which indicates that it is not random but rather some fixed differences between people and their environments.
Comment by qsera 2 days ago
Comment by embedding-shape 3 days ago
I think that's issue, rather than 60K being small.
Most of the actual edits/changes I request to codex are solved within 100-150K tokens, beyond 200K I'd definitively try to restart the session as soon as I could as all models are horrible once you get across ~20% of the total context size. And this is while working on +million LOC codebases.
Problem I guess is that there is no solid and concrete evidence of this (to me [and others seemingly] obvious) degradation, but should be easy to prove, yet no one has time to sit down and show it :)
But the likelihood of a model getting minor details wrong once you're above some magical threshold between 15-20%, seems to skyrocket, and I hit that issue sufficient amount of times that now my workflow is trying to prevent that.
Comment by rtpg 3 days ago
I routinely get claude to do things pretty decently and finish up easily in the 4-5 digit range of tokens. It seems to be doing the right kind of thing to not waste its time looking at 1000 files.
Comment by da_grift_shift 3 days ago
"YOU'RE HOLDING IT WRONG!"Comment by RugnirViking 3 days ago
Comment by perching_aix 3 days ago
Comment by nijave 3 days ago
I usually see this when the context gets "tainted" as I call it. The model gets stuck on a bad path and there's no way to bring it back without clearing the context and starting again.
Frequently it'll be something as small as 1 sentence of a prompt many messages ago.
When cases like that happen, I reset the context and try to be explicit about assumptions and requirements to keep it off the "tainted" path. Other times it's actually useful and agents will do things they normally wouldn't do once the state is tainted. For instance, if you're testing a chat bot's ability to stay on topic, you can seed the context early with what you want it to do. It generally will refuse initially but later on in the conversation it will still silently take that seeded context into account almost "subconsciously" and become more likely to do the thing it originally refused.
Comment by dd8601fn 2 days ago
Overreacting seems to put them back on track, but they’ll “forget” again pretty quickly.
It really depends more on the thing you’re expecting it to “remember” and distance from the last wrist slap.
Comment by CjHuber 3 days ago
Comment by embedding-shape 3 days ago
Comment by CjHuber 3 days ago
Comment by wg0 3 days ago
Comment by properbrew 3 days ago
Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?
Comment by arcanemachiner 3 days ago
Comment by throwaway314155 3 days ago
Which drugs?
Comment by justinclift 3 days ago
Comment by aeonik 3 days ago
Comment by nijave 3 days ago
Comment by fullstackchris 3 days ago
100k tokens "by lunch" is also not my finding, the newer models will hit that already right in the initial exploratory phase
Comment by stavros 3 days ago
Comment by arcanemachiner 3 days ago
Comment by csomar 3 days ago
Comment by trapexit 3 days ago
Comment by csomar 3 days ago
Can you imagine even a junior making such a mistake?
Comment by HarHarVeryFunny 3 days ago
Different models, and versions of models, use different types of attention, which affects their long-context performance, and no doubt also do different amounts/types of long context training.
Different agents build context differently and implement context compaction differently.
Unless someone else is using the same model as you, the same agent/harness as you, and doing very similar tasks, then there is no reason to suppose that their experience of model behavior relating to context size is going to be the same as yours.
Comment by kelnos 3 days ago
Relax, I acknowledged this in my comment...
Comment by tyleo 3 days ago
Comment by pdantix 3 days ago
opus 4.5 would start failing tool calls when approaching its 200k limit, opus 4.6 could get to ~300k before getting confused, opus 4.7 i could stretch to around 400k the dumb zone started, with opus 4.8 i've had sessions get over 500k comfortably.
admittedly we only had limited time with fable, but i had a couple sessions get into 800-900k just fine.
Comment by asd88 3 days ago
Comment by saberience 2 days ago
I mean, I really, really see intelligence tank at a certain amount of context usage. I always start a new session when any implementation work is starting or when starting a new plan.
So I clean context before writing a plan, I clean context before any implementation of a plan. My first prompt is always putting enough of my own context, copy and pastes of docs, etc, to ensure the plan creation is good. Once the plan is made I clean the context and get Opus to implement said plan.
Out of all the methodologies I've tried, this seems to be the best in terms of output quality.
Comment by cyanydeez 3 days ago
Comment by SwellJoe 3 days ago
But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.
The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.
Comment by endless1234 3 days ago
Comment by SwellJoe 3 days ago
But, it does a good job following existing conventions in a codebase, as long as they're really consistent. So the more actively you enforce that consistency the more likely it is to do the right thing without memories or prompting.
I don't like "never do" or "always do" type rules in AGENTS.md or in memory, as it often over-interprets them and ties itself in knots trying to satisfy an impossible set of goals.
Comment by wood_spirit 3 days ago
Comment by justinclift 3 days ago
Comment by mountainriver 3 days ago
Comment by lordgrenville 3 days ago
Can't speak to how good those tests are, but they can't be worse than anecdotal evidence for something as vague/subjective as LLM performance.
Comment by nijave 3 days ago
In the Chroma results, they look at Sonnet 4 which was also terrible in my experience. The same prompt that worked perfectly in Sonnet 4.5 would fail miserably in Sonnet 4
Would be good to see newer tests with both SOTA and open weight. The SOTA ones always seem to follow directions and stay on topic better but it'd be good to have some data to back it up.
Comment by bhy 3 days ago
Comment by kristianc 3 days ago
Comment by nopurpose 3 days ago
Comment by kristianc 3 days ago
- Work Mode - HITL/AFK
- Problem Statement
- Who It Affects - Primary / Secondary User
- User Stories
- Business Case
- Why Now
- Success Critera
- In Scope/Out of Scope [Out of Scope v. important)
- Thinnest Slice (This I've found super valuable, means you max out the amount of 'product' for your buck and avoid diminishing marginal returns or overbuilding. Often I will build this)
- Eigenfeature - What is the larger feature we _could_ (but probably won't) which would solve for this use case and other stuff I might not have thought of
- Technical Notes
- Deps
- Schema Changes
- Risks
- Final Recommendation [go / no go, including on scope]
There's a note in my Claude / Agents MD which says no net new feature gets introduced without this and I get it to move through a pipeline of folders (active, approved, shipped, proposed etc). All runs in a system of MD files and have even created a little MD Kanban from the metadata!
Comment by magicalhippo 3 days ago
I then start a fresh conversation, make it analyze the design document and code, and for larger changes, generate a high-level implementation document which includes concrete phases or steps. I review this plan and iterate if necessary.
Then for each phase I make it generate a detailed plan for that phase and save it along side the other documents. Once the phase is over, I make it write a summary of what was done, decisions made and reasons for it. And typically a good point to compact the model's context.
These documents gives additional context for when I make another model do code review, and help illuminate drift or gaps from the main design document.
Comment by SeriousM 3 days ago
This flow works for my needs, building idea demos, prototypes or tools for my own sake. I don't let agent code in our main code base where everything is still hand tailored. That's a conscious decision.
I noticed that the cheaper models (flash, ...) are quite hard to hold back changing files. A question for possible options sometimes results in "yes, I'll go with option A" without asking back. Frontier models on the other hand love to plan and ask you deliberately for your consent.
I use pi.dev with almost no skills at all to understand how models really work and "feel" to work with.
Comment by da_grift_shift 3 days ago
Comment by mg 3 days ago
Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.
To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.
Comment by hypfer 3 days ago
Comment by loehnsberg 3 days ago
Comment by mg 3 days ago
In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.
Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.
Comment by jgilias 3 days ago
Comment by mg 3 days ago
I wrote my own agent, and it sends data to LLMs in this order: "General Prompts (How to write good code)" + "The Code" + "The Feature Request". This means the KV cache will be used even when the feature request changes.
And output tokens are usually way less than the input tokens.
So I think that my approach is very lightweight on token usage compared to an interactive session.
It would be interesting to measure it for the other agents out there. Sending a feature request two times vs an interactive session.
Comment by ryan_glass 3 days ago
Comment by Tepix 3 days ago
Comment by Chirono 3 days ago
Comment by redox99 3 days ago
Comment by Raphael_Amiard 3 days ago
Comment by cactusplant7374 3 days ago
Comment by cyanydeez 3 days ago
Comment by justinclift 3 days ago
Asking because I could guess that approach would be ok for the types of front end work that doesn't require much security or other validation.
But it sounds like it wouldn't be suitable for work in regulated industries or anything that needs to have extreme care taken.
?
Comment by perching_aix 3 days ago
Comment by mg 3 days ago
Comment by WilcoKruijer 3 days ago
[1] https://pi.dev/
Comment by wood_spirit 3 days ago
I do my own framework and spend a lot of time trying to debug this and it’s not so much the context size in hard numbers but rather the probability that there is debris or wrong directions in the window that are drowning out the things the user thinks are important.
This manifests in the llm that keeps going back to doing the thing that failed when they tried it just before the last approach etc. The frequency of things in the context window give weight even if they are the wrong things.
I have a lot of tricks like not giving the llm lots of tools but rather giving it a tool it can use to search for tools etc.
But the bigger solution is in process where you use something like superpowers to force the llm through stages and you control the context that carries forward.
Comment by deliciousturkey 3 days ago
Comment by ausbah 2 days ago
Comment by doginasuit 3 days ago
It is not that agents can't function with a large context window, they can if that information generally has a desirable signal (like a large initial document or a well-focused session). Mistakes and the confusing signals that come out of fixing mistakes are why performance degrades. I start to trust the context window less not as a matter of size but the amount of friction we run into. The friction can be random but it is more often an issue with the path that I have us on.
Comment by nottorp 3 days ago
That’s what I did intuitively anyway.
Comment by nuc1e0n 3 days ago
If you don't point out what's wrong I find the LLM will go into great technical detail which consumes a lot of tokens, but not 'see the wood for the trees'.
It seems to me human beings also have mechanisms to compact context, which may be why we can forget what we came into a room for when going through doorways. I think it would be interesting to research which markers we use to compartmentalize our thinking.
Comment by schipperai 3 days ago
Comment by faeyanpiraat 3 days ago
Maybe I could achieve better and quicker results with keeping the context in the proper zone, but trying it will have to wait until the next project.
Comment by mcapodici 3 days ago
Comment by PeterStuer 3 days ago
Admittedly I have been doing this precautiously, based on anecdotal evidence, not because I had bad experiences with longer context deterioration myself.
In the brief time I had access to Fable 5, it went on long running tasks (>45 mins) into the 30-40% zone without apparent context coherence problems.
Comment by afc 3 days ago
In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each loop advances the state in a small step towards the final goal.
Comment by RandyRanderson 3 days ago
It seems obvious. Moreover, in a simple model, it seems like whatever tokens you do add have to have MORE information than the average in the existing window.
In a non-trivial model (and this is the model I would choose), since you are adding them to the end, they likely have to have MUCH more information.
Proof as always is an exercise to the reader.
Comment by rsanek 3 days ago
Not really tho right? Since we got to 1m context in mid 2025 nearly no one has gone higher.
Comment by torginus 3 days ago
When it comes to source code, I feel like LLMs could just as well work with something like minified source code, if an LLM is trained on programming well, I think there's no reason why something like a variable should be represented by something more than a single token. Comments can be discarded, etc. In fact considering embeddings for LLMs are very rich, I think common ops could be reduced to a single token.
Imo that's why LLMs are soo good at reverse engineering. A lot of the time, assembly (with symbols) is pretty close to the source code, but compressed and encoded, and if you're familiar with the patterns of your compiler, reversing it is not that difficult.
Anyways, context engineering could be huge boon to input token curation imo (and maybe it already is)
Comment by _def 3 days ago
Comment by elcritch 3 days ago
Comment by cubefox 3 days ago
This would be really easy to measure. Just take some standard benchmarks, but fill up the context beforehand. Is the benchmark performance degraded? If so, by how much?
Comment by Bolwin 3 days ago
Each relevant thing is basically a rule. Trying to so something with 500 rules is what's hard.
If you take a standard benchmark and just prepend a random book to it, it will not capture that
Comment by cubefox 3 days ago
Comment by brunoluiz 3 days ago
Also, some colleagues were playing around with RTK (https://github.com/rtk-ai/rtk), which decreases the amount of token used by tool calls and, although it seems an interesting idea, I am pretty sure there are many caveats. Although, I believe if these type of tools prove to be efficient enough, perhaps harnesses will have them natively.
Comment by kuboble 3 days ago
Few of the best sessions I have ever had with claude went into 700-800k territory.
I frequently reach 400-600k without visible (to me) signs of quality regression.
Comment by da-x 3 days ago
Comment by amunozo 3 days ago
Comment by walthamstow 3 days ago
Comment by steveridout 3 days ago
For example, it may be the case that a long context full of useful information relevant to the task is completely fine, perhaps even beneficial. And if the context contains a bunch of unrelated tangents and conflicting instructions, then it will be detrimental.
Have there been studies on what makes models get dumber? To what extent is context length to blame vs context quality?
Comment by tmp10423288442 3 days ago
Comment by daishi55 3 days ago
I use opus 1m context all day every day at work and I simply have never encountered this. I don’t even think about context windows anymore I just let it do what it wants re compaction. Hard for me to understand where this article is coming from.
Comment by petesergeant 3 days ago
Comment by OutOfHere 2 days ago
Comment by mightyham 3 days ago
Comment by k__ 3 days ago
I had the impression, models would get inconsistent after just 3000 words.
Comment by Der_Einzige 3 days ago
Comment by jackxlau 3 days ago
Comment by cowang 3 days ago
Comment by monster_truck 3 days ago
Comment by vlan121 3 days ago
Comment by dalemhurley 3 days ago
Comment by jsemrau 2 days ago
Comment by mock-possum 3 days ago
Comment by mystraline 3 days ago
What in the models causes this 'dumbing down'?
Comment by andrewshadura 3 days ago
Reminds me the sign, "Do not dumb here. No dumb zone."
Comment by Febriss33 3 days ago
Comment by BrenBarn 3 days ago
Comment by woadwarrior01 3 days ago
Comment by carterschonwald 3 days ago
Comment by kage18 23 hours ago
Comment by Avenassh 3 days ago
Comment by teiji-tango 3 days ago
Comment by 3vo-ai 3 days ago
Comment by ashish296 3 days ago
Comment by jimmypk 3 days ago
Comment by Dollarland 3 days ago
Comment by ianhxu 3 days ago
Comment by haeseong 3 days ago
Comment by breakthematrix 3 days ago