Are the costs of AI agents also rising exponentially? (2025)
Posted by louiereederson 2 days ago
Comments
Comment by smusamashah 8 minutes ago
Comment by easygenes 5 hours ago
GPT-5 was shown as being on the costly end, surpassed by o3 at over $100/hr. I can't directly compare to METR's metrics, but a good proxy is the cost of the Artificial Analysis suite. GLM-5.1 is less than half the cost to complete the suite of GPT-5 and is dramatically more capable than both GPT-5 and o3.
So while their analysis is interesting, it points towards the frontier continuing to test the limits of acceptable pricing (as Mythos is clearly reinforcing) and the lagging 6-12 months of distillation and refinement continuing to bring the cost of comparable capabilities to much more reasonable levels.
Comment by avidphantasm 16 minutes ago
Comment by thelastgallon 9 hours ago
Comment by nopinsight 5 hours ago
That raises a question: if practical-tier inference commoditizes, how does any company justify the ever-larger capex to push the frontier?
OpenAI's pitch is that their business model should "scale with the value intelligence delivers." Concretely, that means moving beyond API fees into licensing and outcome-based pricing in high-value R&D sectors like drug discovery and materials science, where a single breakthrough dwarfs compute cost. That's one possible answer, though it's unclear whether the mechanism will work in practice.
Comment by zozbot234 5 hours ago
Comment by tibbar 5 hours ago
Comment by zozbot234 5 hours ago
Comment by boxedemp 5 hours ago
I think you're overestimating, or oversimplifying. Maybe both.
Comment by jurgenburgen 2 hours ago
Assuming you used o3, that would cost $58800 per week. That’s an expensive bet for only 50% odds in your favor.
Of course the agents are only that good on benchmarks, in reality your odds are worse. Maybe roulette instead?
Comment by raincole 5 hours ago
> I think you're overestimating, or oversimplifying
Yeah if you only read comments on HN but not the actual linked article you will get oversimplified conclusion. Like, duh?
Comment by TeMPOraL 1 hour ago
Curiously, for most submissions it's the opposite - comments are much more useful and nuanced than the source being discussed.
Comment by boxedemp 5 hours ago
Comment by ting0 1 hour ago
Comment by EdvinPL 50 minutes ago
This way - AI work is like a slot machine - will this work or not? Either way - casino gets paid and casino always wins.
Nevertheless - if the idea or product is very good (filling high market pain) and not that difficult to build - it can enable non-coders to "gamble" for the outcome with AI for $.
Sadly - from by experiences hiring Devs - hiring people is also a gamble...
Comment by ketzu 34 minutes ago
This is the weirdest example of "gambling" I have seen in my life. If you'd've written "unprotected sex" I'd see the gambling part, but "extramartial sex" covers so much more than the tiny subset of "whose baby is it" (how many people are there having sex to gamble on who will be the father of a baby? 10?).
This made my day.
Comment by stavros 17 minutes ago
Comment by dang 12 hours ago
Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)
Comment by greenmilk 11 hours ago
Comment by wsun19 10 hours ago
Comment by henry2023 8 hours ago
Comment by dannersy 2 hours ago
Comment by wavemode 9 hours ago
Where the long-term payoff still seems speculative, is for companies doing training rather than just inference.
Comment by Gigachad 9 hours ago
Comment by hypercube33 8 hours ago
What I'm curious about are what about the other stuff out there such as the ARM and tensor chips.
Comment by raincole 7 hours ago
Comment by jagged-chisel 10 hours ago
Comment by quicklywilliam 10 hours ago
So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.
Comment by esperent 7 hours ago
Do we? Because elsewhere in the thread there's people claiming they are profitable in API billing and might be at least close to break even on subscription, given that many people don't use all of their allowance.
Comment by ai-x 5 hours ago
Step 1) Bubble callers will be proven wrong in 2026 if not already (no excess capacity)
Step 2) Models are not profitable are proven wrong (When Anthropic files their S1)
Step 3) FOMO and actual bubble (say around 2028/29)
Comment by dminik 1 hour ago
I have no data to support this, but I think they just about break even on API usage and take overall loss on subscriptions/free plans.
Comment by 2848484995 2 hours ago
Comment by lwhi 1 hour ago
Comment by agentifysh 9 hours ago
Difference is that the current prices have a lot of subsidies from OPM
Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.
So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.
Comment by jiggawatts 5 hours ago
For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.
During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.
NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.
New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.
There's also algorithmic improvements like the recently announced Google TurboQuant.
Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.
Comment by zozbot234 5 hours ago
Isn't reading from flash significantly more power intensive than reading DRAM? Anyway, the overhead of keeping weights in memory becomes negligible at scale because you're running large batches and sharding a single model over large amounts of GPU's. (And that needs the crazy fast networking to make it work, you get too much latency otherwise.)
Comment by jiggawatts 3 hours ago
> becomes negligible at scale
Nothing is negligible at scale! Both the cost and power draw of the HBMs is a limiting factor for the hyperscalers, to the point that Sam Altman (famously!) cornered the market and locked in something like 40% of global RAM production, driving up prices for everyone.
> sharding a single model over large amounts of GPUs
A single host server typically has 4-16 GPUs directly connected to the motherboard.
A part of the reason for sharding models between multiple GPUs is because their weights don't fit into the memory of any one card! HBF could be used to give each GPU/TPU well over a terabyte of capacity for weights.
Last but not least, the context cache needs to be stored somewhere "close" to the GPUs. Across millions of users, that's a lot of unique data with a high churn rate. HBF would allow the GPUs to keep that "warm" and ready to go for the next prompt at a much lower cost than keeping it around in DRAM and having to constantly refresh it.
Comment by zozbot234 3 hours ago
Flash has no idle power being non-volatile (whereas DRAM has refresh) but active power for reading a constantly-sized block is significantly larger for Flash. You can still use Flash profitably, but only for rather sparse and/or low-intensity reads. That probably fits things like MoE layers if the MoE is sparse enough.
Also, you can't really use flash memory (especially soldered-in HBF) for ephemeral data like the KV context for a single inference, it wears out way too quickly.
Comment by adrian_b 2 hours ago
However, for old-style 1-bit per cell flash memory I do not see any reason for differences in power consumption for reading.
Different array designs and sense amplifier designs and CMOS fabrication processes can result in different power consumptions, but similar techniques can be applied to both kinds of memories for reducing the power consumption.
Of course, storing only 1 bit per cell instead of 3 or 4 reduces a lot the density and cost advantages of flash memory, but what remains may still be enough for what inference needs.
Comment by colechristensen 9 hours ago
128GB is all you need.
A few more generations of hardware and open models will find people pretty happy doing whatever they need to on their laptop locally with big SOTA models left for special purposes. There will be a pretty big bubble burst when there aren't enough customers for $1000/month per seat needed to sustain the enormous datacenter models.
Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.
Comment by hypercube33 8 hours ago
Comment by MrBuddyCasino 6 hours ago
Comment by Tepix 5 hours ago
Comment by lookaround 8 hours ago
My guy, look around.
They are coming for personal compute.
Where are you going to get these 128GBs? Aquaman? [0]
The ones who make RAM are inexplicably attaching their fate to the future being all LLMs only everywhere.
Comment by naveen99 8 hours ago
Comment by adrianN 6 hours ago
Comment by bitwize 7 hours ago
End users will still get access to RAM. The cloud terminal they purchase from Apple, Google, Samsung, or HP will have all the RAM it will ever need directly soldered onto it.
Comment by xantronix 6 hours ago
Comment by bitwize 4 hours ago
Comment by seanmcdirmid 7 hours ago
Comment by foota 8 hours ago
Comment by matt3210 9 hours ago
Comment by siliconc0w 7 hours ago
Happy to run it on your repos for a free report: hi@repogauge.org
Comment by noosphr 7 hours ago
If they can do a task that takes 1 unit of computation for 1 dollar they will cost 100 dollars for a 10 unit task and 10,000 for a 100 unit task.
Project costs from Claude Code bear this out in the real world.
Comment by twaldin 2 hours ago
Comment by keepamovin 5 hours ago
Comment by chii 4 hours ago
that depends on the ability to produce supply at a saturation rate.
It did work for internet backhaul links - ala, those dark fibres. However, i reckon those fibres are easier to manufacture than silicon chips.
I wonder if saturation is possible for ai capable chips.
Comment by maxbeech 16 minutes ago
Comment by agdexai 1 hour ago
Comment by Zero_jester 25 minutes ago
Comment by loklok5 1 hour ago
Comment by pavelbuild 3 hours ago
Comment by samoladji 4 hours ago
Comment by linzhangrun 4 hours ago
Comment by srslyTrying2hlp 11 hours ago
Comment by totalmarkdown 11 hours ago