Running local models is good now
Posted by jfb 18 hours ago
Comments
Comment by c0rruptbytes 16 hours ago
You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow
You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes
You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)
So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs
On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.
So are they good? not really. Do they work? yes
edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for
Comment by saghm 16 hours ago
The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.
Comment by rapind 15 hours ago
I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.
I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.
Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.
Comment by milesvp 7 hours ago
Comment by horacemorace 4 hours ago
Comment by larodi 2 hours ago
Comment by aamoscodes 15 hours ago
Comment by fc417fc802 8 hours ago
More than that, they have various zero data retention options and provide a convenient json list of them.
Comment by larodi 2 hours ago
Comment by fc417fc802 2 hours ago
Plumbing you straight through would require nonstandard certificate juggling and they wouldn't be able to implement their core service of providing a standardized API nor could they transparently route your request to the fastest / cheapest / whatever provider on the fly nor could they implement transparent fallback nor could they implement their policy of not billing you if the response from the provider is invalid.
Also the chosen provider could fingerprint your network stack if you communicated directly. The routing service is acting as a proxy and for most providers fully anonymizes requests (it does send a stable uid to some of them though).
Comment by rapind 14 hours ago
Comment by darkmarmot 14 hours ago
Comment by rapind 14 hours ago
There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.
Comment by rob74 57 minutes ago
Comment by naikrovek 8 hours ago
Maybe people will trust companies, but those companies will rarely deserve that trust. Anyone that pays attention sees breach announcements almost every day. Security is never a concern for these companies until it embarrasses them. Then, as soon as the negative attention fades, security again becomes the second to last priority.
Do not trust companies with any data that is important to you unless the effective management of that data is required by law, and the laws are comprehensive.
Comment by fc417fc802 8 hours ago
Comment by pessimizer 14 hours ago
I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.
Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.
Comment by jen20 4 hours ago
Comment by kube-system 4 hours ago
Bedrock in fact does not train on your data. It was a big deal when it was announced that they share data with Anthropic for Fable, but even then it was gated away where you’d have to explicitly allow it.
Comment by rlkf 12 hours ago
You can run Qwen3 on OVH already:
<https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalo...>
Comment by johndough 11 hours ago
Does anyone know if OVH is ignoring the law here, or whether it does not apply for some reason?
Comment by nl 4 hours ago
There are much less (almost no) disclosure regulations on the deployer.
https://ethicalogic.com/articles/gpai-guide-roles-public-dat...
Comment by dofm 2 hours ago
Comment by dofm 10 hours ago
Not doubting you — just want to read it!
Comment by johndough 10 hours ago
The definition of a "genral-purpose AI model" is described in more detail in the "Guidelines on the scope of obligations for providers of general-purpose AI models under the AI Act": https://ec.europa.eu/newsroom/dae/redirection/document/11834...
Comment by saghm 14 hours ago
Comment by rapind 13 hours ago
For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).
Comment by gaolei8888 13 hours ago
Comment by Bnjoroge 15 hours ago
Comment by djmips 4 hours ago
Comment by bel8 14 hours ago
I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.
Comment by yencabulator 14 hours ago
Comment by spockz 14 hours ago
Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.
I haven’t tried any tool that compresses the tokens yet.
Comment by echelon 13 hours ago
1. The hardware will eventually catch up.
2. This keeps the delta between frontier models smaller.
3. We can still fine tune and own the weights.
4. The models will be more useful, faster, and reliable.
RTX is hobbyist tier, not professional tier.
Gated cloud models from hyperscalers treat us like hobbyists in their own right.
We need equivalent scale models, but open.
Comment by zozbot234 13 hours ago
Comment by echelon 12 hours ago
This is what RunPod-type services are for.
For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.
I can rent an H200 for $3.50 an hour. That's INSANELY cheap.
I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.
The ideal solution is models we own run on RunPods leveraging H200s.
I can spend $100-200/day on compute making much more value with the model outputs.
----
edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.
You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.
Comment by spockz 12 hours ago
However, we need to use the tools that we have. Even if I wanted to buy a (bunch of) H200 for me and my colleagues and could get the expense approved, they are hard to source where we are.
Yes. You can rent them, but I’m not sure how that affects the IP discussion.
Moreover, not everyone is doing coding and video so we have different tasks that can fit quite well on relatively light laptops (Gemma et al), for relatively directed coding sessions we can make do with RTX cards, or a small step up, all the way to H200 in the workstation. Or pods thereof.
We have the graphics cards and laptops with MLX right now. The H200 will take a year at least to arrive. Better get used to run stuff locally.
Comment by zozbot234 12 hours ago
Comment by what 5 hours ago
That’s hardly contrarian here, lol.
Comment by echelon 5 hours ago
I swear, two thirds of the folks here just make comments that dunk on AI. They underestimate it, hate it, hate those that use it, etc. It's the "old angry man yells at cloud" trope.
I've had so many consecutive days of "-4" karma posts that HN is blocking me from commenting. And the comment retorts I get from these folks are absolute gems that will undoubtedly age like milk.
Comment by SR2Z 13 hours ago
Comment by dofm 10 hours ago
Comment by FridgeSeal 5 hours ago
Comment by MrLeap 13 hours ago
Comment by redmalang 14 hours ago
Comment by boppo1 9 hours ago
Comment by fc417fc802 7 hours ago
Comment by ryukoposting 14 hours ago
Comment by saghm 11 hours ago
Comment by aftbit 16 hours ago
If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.
Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.
Comment by ryan_glass 13 hours ago
Comment by dofm 15 hours ago
Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.
But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).
Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.
Comment by EagnaIonat 14 hours ago
Even faster with the MLX builds.
Then when I need more heavy lifting I fire up a larger model.
IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.
Comment by aftbit 12 hours ago
Comment by EagnaIonat 3 hours ago
You can do coding and agentic fine. For coding I use qwen3.6:35b-mlx and agentic granite4.1:3b works fine.
These are the models I use.
- granite4.1:3b
- granite4.1:30b
- gpt-oss:20b
- gpt-oss:120b (less so now)
- mistral-small3.2
- qwen3.6:35b-mlx
There will always be use cases that don't sit on your laptop, but most of what can be done can be done locally, it just requires a good framework to sit on it.
Comment by azeirah 4 hours ago
It's worse at general tasks, but in the precise domain of coding I actually prefer to use it over my claude subscription because it has 0 latency (and no privacy concerns whatsoever).
Comment by girvo 10 hours ago
You're right that prefill kills perf, but shrug the GB10 has far more compute than it has memory bandwidth, so prefill isn't it's bottleneck.
Comment by htrp 9 hours ago
Comment by girvo 8 hours ago
I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine
Comment by EnPissant 1 hour ago
Comment by wincy 14 hours ago
Comment by layer8 14 hours ago
Comment by wincy 10 hours ago
Comment by girvo 10 hours ago
Comment by qudat 9 hours ago
Comment by jtbaker 14 hours ago
> but still not quite in the realm of Sonnet or DeepSeek 4 Flash
these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4
Comment by trueno 13 hours ago
me thinks there's a lot of optimization strats we're currently leaving on the table just because the amount of things to explore and test are so expansive. but this one is super interesting targeting metal primarily and zeroing in on one model. instead of a one size fits all llama.cpp im very interested to see if theres a future where super tailor-made variants per model pans out to harnesses that can rapidly switch ultimately providing something akin to sonnet/early opus territory (that's my personal bench mark of good-enough i shall now cancel the hell out of this claude sub)
Comment by jtbaker 12 hours ago
Comment by aftbit 12 hours ago
Comment by jtbaker 12 hours ago
With this configuration (set up over the last month) I have been working on Python data processing tools, an internal Svelte 5/SvelteKit data intensive BI app, and some smaller Rust projects. It's been doing really well there.
Comment by monksy 12 hours ago
Comment by aftbit 12 hours ago
Prices will fall in the next few years. Maybe just play with the tiny toy models for now to learn how they work, then keep using API providers until they do.
Comment by eek2121 16 hours ago
Comment by mathisfun123 16 hours ago
Comment by zozbot234 16 hours ago
Comment by stemlord 11 hours ago
This isn't really good enough. Many of us need to get things done in a pinch and if our employers are already getting used to the idea of paying for enterprise subscriptions to cloud llm's then the local option needs to be good
Comment by wolvoleo 9 hours ago
Comment by greenavocado 16 hours ago
This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
I don't have enough system RAM to properly handle the large context windows so I don't use local models.
# 1,257 tokens 17s 72.18 t/s
$env:CUDA_DEVICE_SCHEDULE = "SPIN"
cd D:\src\llama.cpp\
.\build\bin\Release\llama-server.exe `
--port 8080 `
--host 127.0.0.1 `
-m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
-fitt 2048 `
-c 98304 `
-n 32768 `
-fa on `
-np 1 `
--kv-unified `
-ctk q8_0 `
-ctv q8_0 `
-ctkd q8_0 `
-ctvd q8_0 `
-ctxcp 64 `
--mlock `
--no-warmup `
--spec-type draft-mtp `
--spec-draft-n-max 2 `
--spec-draft-p-min 0.1 `
--chat-template-kwargs '{\"preserve_thinking\": true}' `
--temp 0.6 `
--top-p 0.95 `
--top-k 20 `
--min-p 0.0 `
--presence-penalty 0.0 `
--repeat-penalty 1.0Comment by themanualstates 15 hours ago
Comment by halJordan 14 hours ago
Comment by greenavocado 13 hours ago
Comment by nateb2022 15 hours ago
Comment by boguscoder 3 hours ago
Comment by Terretta 15 hours ago
Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.
And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.
Comment by ridiculous_leke 15 hours ago
Comment by mattmanser 15 hours ago
The Q4_K_XL bit for those not in the know.
Comment by stymaar 15 hours ago
Comment by embedding-shape 14 hours ago
Comment by c0rruptbytes 14 hours ago
Comment by greenavocado 15 hours ago
Comment by greenavocado 15 hours ago
Comment by stymaar 15 hours ago
Comment by greenavocado 15 hours ago
Comment by c0rruptbytes 14 hours ago
local models do involve some context engineering to get it okay, but it's not that rough
Comment by xlii 57 minutes ago
This really depends on how and what you're using. e.g. I can't suffer through slowness of inference on Macbook but I have gaming rig with quite powerful GPU and I squeeze ~130 t/s on Gemma or ~70t/s on Qwen.
Tuning is not optional as well. Qwen on temperatures > 0.5 is unusable for coding and I found sweet spot around 0.32 for coding. Speculative decoding on Gemma4 26B is a 30t/s difference between non-speculative.
The worst thing with local models is that I can't just give you a recipe, because what's the best params depends on your use case.
In the nutshell I'd compare local models to running game rig on Windows vs Linux. Linux works great if not better than Windows gaming, but you need to embrace some tweaking in order to get there. Is it there? It's not SOTA, that's for sure, but it's working reasonably well.
Comment by adam_arthur 16 hours ago
It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
I agree that for coding/creation use cases, there's still not a compelling argument for local models.
But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.
Comment by dstryr 15 hours ago
Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...
Comment by adam_arthur 15 hours ago
E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.
Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).
Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.
Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)
Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)
Applies to other rule following as well in my experience.
Qwen may be better at toolcalling and certainly probably codegen.
It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.
Comment by trouve_search 15 hours ago
I'm really surprised how much slower a DGX spark is for the same price.
1. Here's my command.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'
Comment by adam_arthur 15 hours ago
You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.
But I'd take the simplicity of a single thread and higher throughput personally.
Overall of course still better to wait for next gen devices if you can.
Comment by diddid 10 hours ago
Comment by ozim 13 hours ago
I was expecting it would run Q8 in 50 tok/s.
I guess that’s good I stopped thinking about buying it because I would be disappointed.
Comment by girvo 10 hours ago
That said, I'm still running Step 3.7 Flash at ~40tk/s decode, 1000tk/s+ prefill on mine and its both very capable and fast enough
I got Gemma 31b to run on this at ~22tk/s decode at FP8 using MTP
Comment by gopher_space 15 hours ago
If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.
Comment by msp26 15 hours ago
Comment by freehorse 13 hours ago
This is sadly also my experience. I wish we had some MoE models with a higher ratio of active parameters per total. My experience is that the newer MoE models that can run in a 64b laptop have too few active parameters to be useful outside narrower, specific tasks. Mixtral 8x7b was a 14b active parameter (56b total) MoE model a few years ago and was probably the best model one could run in that range for some time, but it is too old now.
I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.
Comment by c0rruptbytes 13 hours ago
I'm on a 48gb M5 Pro right now and it's been okay, a lot of my rough experiences have been with MLX and I'm finding that GGUFs are okay now
Comment by hnlmorg 14 hours ago
In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)
Comment by Stagnant 13 hours ago
Comment by amdivia 12 hours ago
The real lovely thing was getting 300+ Tok/s (Gemma 4 26B QAT + MTP at UD-Q4_K_XL) (at peak, I think I saw vram usage reach 21 GB of vram)
Comment by lmedinas 11 hours ago
Comment by andy_ppp 10 hours ago
Comment by wgd 10 hours ago
Comment by andy_ppp 9 hours ago
Comment by not_kurt_godel 9 hours ago
Comment by devilsdata 10 hours ago
Comment by pizzafeelsright 10 hours ago
and if remaining local, the hardware required to run multiple poor models could be better spent running better models.
I have attempted to orchestrate using different models, loading and unloading, but the speed is not there and by the time mistakes are discovered considering the lack of quick iteration the results become worthless unless the task is trivial.
Comment by heipei 16 hours ago
Comment by jstanley 16 hours ago
I don't care how many tokens per second of nonsense it can generate.
Comment by throwawayffffas 15 hours ago
Comment by notnullorvoid 15 hours ago
Comment by heipei 16 hours ago
It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.
My suggestion is to simply try it and see what it feels like.
Comment by lelanthran 14 hours ago
Well, you aren't going to give it a 20k line sec and have it churn out a full app after 4 hours hours.
But, you can get it to write code for you if you do the design.
Comment by myaccountonhn 16 hours ago
Comment by data-ottawa 16 hours ago
I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.
Comment by garciasn 16 hours ago
Comment by sgt101 13 hours ago
(geddit?)
Comment by CamperBob2 15 hours ago
The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.
Comment by c0rruptbytes 15 hours ago
you get a macbook for work, you run the macbook
they're not going to start giving GPUs to employees to run local models
Comment by FuriouslyAdrift 15 hours ago
Our GPU computer server cost $110k.
Comment by abalashov 3 hours ago
Comment by beadw 10 hours ago
Comment by peterlk 10 hours ago
Comment by ridiculous_leke 15 hours ago
Comment by smcleod 11 hours ago
Comment by EnPissant 2 hours ago
- Maximum intelligence per VRAM (you dont have much)
- Dense models can benefit from MTP to get an almost 2x speedup in decode (ie, a 27b dense model with mtp decodes at about the same speed as a MoE model with 14b active param model would). This is important because local llm rarely has parallel streams to batch together.
When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:
- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.
Comment by NamlchakKhandro 2 hours ago
If I can't customise it then I won't waste my time using it it getting use to it.
Claude code is trash, it's customisability is extremely shallow, open code, codex, copilot, Kiro, etc etc... all trash. Yes even open code..
If open code was so awesome then open claw would have been based on it... But it wasn't. That's should tell you everything you need to know.
Comment by locknitpicker 2 hours ago
You are somehow assuming cloud-based models are not painful.
I can tell you my past experience. I was using GPT 5.5 and Claude Opus interchangeably and I prompted them to implement a feature. I paid attention to the agent window and it was literally screwing up implementations, causing tests to fail, and going into test-fail-fix loops to clean up after itself. After a few minutes, it finally called it done. That run cost $0.60.
I went to review the code and only half of the source files complied with the instruction files. I prompted the model to clarify why it failed to comply with the instruction file. The model outputs "you are right, I should have complied with the instruction files. That prompt cost $0.30.
I prompted the model to proceed and apply the instruction file prompts. It went ahead and applied changes. Success. It cost $0.16.
I reviewed the code again. Only half of the sloppy code was touched up. I prompted it to fix the whole mess, not just a couple of files. It complied. One coin less in my purse.
So, around a third of the cost of a feature is spent on the model cleaning the mess it left in it's wake.
And this was a tiny feature with a plan, a solid set of instruction files.
Very expensive.
Are costs going down? I doubt so. OpenAI seems to still be spending 3 times it's revenue already.
In comparison, local models sound very good.
Comment by robomartin 13 hours ago
Laptop?
OK, I've made that mistake before. I understand modern laptops are powerful, but nobody wanting to do serious AI/ML work should be using a laptop for anything other than SSH or similar low-performance access into a proper system.
Years ago I fried two laptops just doing finite element analysis work running 18+ hours per day. It was one of those "I'm giving you all she's got, Captain!" workloads. They fried, even with powerful fans cooling them. I should have known better. Such workloads belong on purpose built systems.
Comment by atomicnumber3 14 hours ago
1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram
2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.
I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.
The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.
I find with both models that:
- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"
- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)
- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed
- they both cannot really be given a large ish task and left to just drive it on their own
The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.
I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that). And while claude might have slightly higher velocity allowed while remaining suborbital, it's very diminishing returns.
So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.
Comment by everdrive 16 hours ago
Comment by throwawayffffas 15 hours ago
You generally want to run q8 or some kind of "6bit" quantization at least.
40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.
Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.
Comment by zozbot234 15 hours ago
Comment by abalashov 15 hours ago
Comment by throwawayffffas 15 hours ago
Comment by abalashov 11 hours ago
Comment by throwawayffffas 18 minutes ago
Comment by ValdikSS 15 hours ago
Comment by trouve_search 15 hours ago
Comment by monegator 15 hours ago
Comment by greenavocado 16 hours ago
Comment by iwontberude 16 hours ago
Comment by dominotw 16 hours ago
i use it usecases like that latter and they are fine.
Comment by citizenpaul 13 hours ago
Comment by iwontberude 10 hours ago
Comment by DiabloD3 3 hours ago
Comment by iLoveOncall 1 hour ago
Comment by hypfer 17 hours ago
It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.
Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.
I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.
Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.
Anyway, point is: full ack on that headline.
Comment by ggerganov 16 hours ago
[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...
[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...
Comment by trilogic 16 hours ago
Gerganov, hope you will consider developing further the CLI cause we suffering with the server.
Comment by jayGlow 15 hours ago
Comment by mft_ 12 hours ago
I’ve also tried OpenCode (similar but a bit less so) and Pi (fast but you have to add lots of features yourself which is a bit of a pain). Claude Code can also be pointed at a local model and works, but the default system prompt is huge. (~140k of text when I extracted mine, IIRC.)
Comment by vorticalbox 10 hours ago
Comment by trilogic 14 hours ago
About the harness depends on for what you need it, but basically for a universal unit of measure, Harness is multilayered and logic and domain specific dependent. I would definitely include Type of Hardware, Model parameters/knowledge, Model Intelligence, Model size/context, type of conversion, type and quantization (models comes with some default tools), but adding your (domain specific), skills, tools, memory, logs, security, Rag, Online search... (which as scary as they sound are mostly simple logics in a txt file, like if this do that).
The full pack is Harness 10, every missing thing lower the harness score.
To answer to your question I would definitely recommend smth like HugstonOne (or anyway llama.cpp CLI) with Qwen 3.6 35B finetuned/distill (deepseek 4 or claude 4.7) with none of the current coding agents out there that are screaming internet connection and proprietary API and data collection. DO this, if you can find a tool that you can download and choose a local model (of your choice in whatever folder locally) and load it ready for inference without any need of internet connection that is the tool you should aim for. Right now there is none out there.
Comment by kpw94 16 hours ago
Curious if you can share the prefill speed too?
I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.
Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.
Huge Thank you for llama.cpp btw!!
Comment by ggerganov 15 hours ago
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d512 | 3714.02 ± 10.85 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d1024 | 3684.86 ± 15.21 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d2048 | 3650.80 ± 8.53 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d8192 | 3473.88 ± 0.97 |
| qwen35 27B Q4_K - Medium | 15.92 GiB | 27.32 B | CUDA | 1 | pp2048 @ d32768 | 2754.69 ± 4.07 |
ggml_metal_device_init: GPU name: MTL0 (Apple M2 Ultra)
| model | size | params | backend | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d512 | 379.75 ± 0.21 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d1024 | 377.15 ± 0.35 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d2048 | 371.46 ± 0.91 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d8192 | 344.84 ± 0.41 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | MTL | 1 | pp2048 @ d32768 | 222.42 ± 5.29 |
Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.
Comment by kpw94 15 hours ago
I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)
At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.
It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.
Comment by girvo 10 hours ago
This really is the secret to getting the most out of these models IMO. Pi is so damned good. I have a strongly tuned Pi for running Step 3.7 Flash (IQ4_XS) and Qwen 3.6 27B (FP8)
Also, thank you for llama.cpp mate :)
Comment by androiddrew 9 hours ago
Comment by celrod 16 hours ago
Comment by ggerganov 16 hours ago
[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...
Comment by girvo 10 hours ago
Comment by toddmorey 14 hours ago
Comment by fridder 16 hours ago
Comment by StevenWaterman 17 hours ago
Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.
Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization
Comment by indoordin0saur 17 hours ago
For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.
Comment by amoshebb 17 hours ago
Comment by suncemoje 16 hours ago
Comment by indoordin0saur 16 hours ago
Comment by lelanthran 13 hours ago
If this mattered to them, they wouldn't be running so much in the cloud or in proprietary software that they have no ability to air-gap.
If companies ever cared about this, Windows would not be dominant on the desktop.
Comment by indoordin0saur 12 hours ago
As to why Windows is so dominant, I'm as clueless as you.
Comment by suncemoje 2 hours ago
I wonder if that's because they don't know better or because of a lack of trust or costs?
Comment by suncemoje 16 hours ago
Comment by cyanydeez 16 hours ago
Comment by hughw 16 hours ago
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_CONTEXT_LENGTH=180000
and that fits in 23GB.[edited for format]
Comment by giancarlostoro 17 hours ago
If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.
Comment by StevenWaterman 17 hours ago
I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know
Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.
In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)
Comment by zozbot234 16 hours ago
Comment by 0xc133 16 hours ago
Comment by cyanydeez 16 hours ago
I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.
Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.
Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.
Comment by QuantumNoodle 16 hours ago
Comment by iamtheworstdev 15 hours ago
Comment by fluoridation 13 hours ago
AFAIUI, there'd be little advantage in having a higher speed inter-card connection, because the cards don't really talk to each other during inference. The loss of efficiency compared to a monolithic memory architecture comes from scheduling, not from data transfer.
Comment by Andrex 15 hours ago
Comment by epistasis 16 hours ago
OMG this is such an annoying property, just shut the hell up please, and be concise.
I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.
And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.
And look, there I did exactly what I was complaining about...
Comment by bityard 16 hours ago
For example, the Claude web UI has an Instructions field where I have told it never to congratulate or praise me for asking questions. Earlier Copilot models used a ridiculous number of emoji and bullet lists when answering literally every prompt, I told it to knock that off and prefer detailed paragraphs in prose.
Local agents/frameworks/whatever all have their equivalents for overall user preferences.
Comment by epistasis 15 hours ago
Asking Claude for this provides incorrect instructions for me, so I'm guessing it moves around a lot.
Comment by illegalsmile 16 hours ago
Comment by epistasis 16 hours ago
Edit: also, how can I stop the LLM from all this fake glazing, as if every question I have is some sort of unique genius insight, it's so damn annoying. I just got the third straight round of this while merely trying to get summarization of a PDF:
> Good question — it gets right at a real tension in the paper. Let me check the current state of actual SV-imputation efforts, since this has moved since 2020.
Comment by bornfreddy 15 hours ago
Comment by frereubu 12 hours ago
Comment by derethanhausen 17 hours ago
Comment by hypfer 17 hours ago
Hmm. I think I might just fundamentally disagree with Anthropic about the idea of what a "tool" should be.
Comment by hypfer 2 hours ago
> But re-reading the comment:
> "In the real world however, it does not. Hence, in the future we might fail this check even if it works within this limited check."
> The comment says "we might fail this check even if it works" — implying the original intent was to always fail (return 1) as a conservative stance, leaving room to tighten it later. The new code does the opposite of what the comment describes.
> So either:
> 1) The comment needs updating to reflect the new intent, or
> 2) The logic is wrong and should still return 1 unconditionally if the original intent was preserved.
> Which is it — did you intentionally want to succeed when X works, making the comment outdated? Or should the return 1 be restored?
> User: did you see the "in the future"?
> Yes, fair point — "in the future we might fail this check" is forward-looking, not a statement of current intent. The comment is saying the door is left open to tightening the check later, not that it should currently always fail. So the current code is correct and the comment still holds. All good.
___
It reads/comprehends 5% of what is on screen but acts like it knows it all, bossing you around. Why is it like this. What's wrong with Anthropic and what's wrong with people _not_ getting incredibly angry at that unearned smartass.
Comment by KaoruAoiShiho 13 hours ago
Comment by kitd 17 hours ago
Comment by otabdeveloper4 12 hours ago
(In fact, "that colleague" might have even been the source of the RLHF training set.)
Comment by linuxhansl 7 hours ago
Comment by hypfer 2 hours ago
They are a lot faster, yes, but at least for my needs, they feel a lot dumber too.
It's a shame that the gemma4 that _would_ fit 24GB VRAM is MoE. :(
Comment by radium3d 17 hours ago
This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.
The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.
Comment by giancarlostoro 17 hours ago
Comment by StevenWaterman 17 hours ago
Comment by giancarlostoro 16 hours ago
Comment by whythismatters 17 hours ago
Comment by giancarlostoro 16 hours ago
Comment by MostlyStable 17 hours ago
Comment by clickety_clack 17 hours ago
Comment by MostlyStable 16 hours ago
Comment by Scoundreller 17 hours ago
I’d think the volume for that category would be low but LLMs aren’t just for coding.
Comment by dghlsakjg 16 hours ago
Sure I could splash out a ton of money for a high ram Mac, but deepseek is so dirt cheap that I think depreciation on a high end machine costs more than my api spend.
Example of what I’m using it for: building a semantic database of podcast content (podcast discoverability sucks on an episode level). I need a cheap LLM, an embedder, a transcriber, none of which Claude will do.
My api costs for coding agents plus running apps are about ~$20/month, but I get more than just chat + Claude code.
If all I was doing was pumping an employers codebase through a coding agent, Claude would be the answer.
Comment by chrisweekly 17 hours ago
Comment by clickety_clack 16 hours ago
Comment by andix 15 hours ago
Comment by dackdel 16 hours ago
Comment by giancarlostoro 16 hours ago
My Mac only has 16GB of VRAM (20GB total - 8 is reserved for the OS) so I have to leave room for VRAM, I usually find a model that fits in 5 to 7 GB of VRAM and then max the context window as much as I can.
Comment by daemonologist 13 hours ago
Comment by pixelesque 14 hours ago
sudo sysctl iogpu.wired_limit_mb=18800
will allow you to use more, but you do need to leave a bit for the OS obviously!
Comment by giancarlostoro 14 hours ago
Comment by pixelesque 13 hours ago
Comment by dackdel 4 hours ago
Comment by indoordin0saur 17 hours ago
Comment by hypfer 17 hours ago
Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.
Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/
Comment by indoordin0saur 17 hours ago
Comment by all2 16 hours ago
Comment by dghlsakjg 16 hours ago
Comment by hypfer 16 hours ago
Crypto (to my knowledge at least) moved away from GPU mining. I guess you could maybe rent out GPU compute, but - being in germany - it's not worth the legal hassle. You could of course always commit tax fraud, though I wouldn't recommend that.
Comment by esseph 16 hours ago
Massive legal liability. Not worth it.
Comment by Rzor 10 hours ago
edit: nvm, I'm confusing models.
Comment by cdelsolar 17 hours ago
Comment by zerd 16 hours ago
[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...
Comment by chrisweekly 17 hours ago
Comment by ltononro 16 hours ago
Comment by dyauspitr 14 hours ago
Comment by calebm 15 hours ago
Comment by cmrdporcupine 15 hours ago
FWIW Codex/GPT models are way less this way. Maybe to a fault.
I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.
Comment by rmunn 17 hours ago
Comment by sathackr 17 hours ago
It won't happen with AI models either.
It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.
Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.
I'm in a relatively small business, we recently had an outage related to our local infrastructure.
I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.
Everyone wants to shuck the chore and the responsibility.
Comment by preommr 16 hours ago
AI is different.
Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.
In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.
The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.
And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.
Comment by codethief 14 hours ago
I agree. The other thing here is that, once you can run LLMs on a single piece of commodity hardware (whether that includes one GPU or several), the difference between cloud vs. on-premise LLMs will largely be about where your hardware is located. There will be very little software configuration involved (just an HTTP endpoint that talks to the GPU). This is decidedly different from cloud products where the moat of hyperscalers is largely in the software and services on top of the hardware, not the hardware itself. (Sure, GPUs will eventually break & need replacement, too, but there's no state to lose, so that's already orders of magnitude easier than replacing hard drives.)
Comment by rmunn 4 hours ago
Comment by 15155 13 hours ago
For some applications, sure. Availability is a large part of what one is paying for with cloud computing, but it's also something that not every business needs.
If you sacrifice availability and have a pure-compute use case (low durability requirements), on-prem can quickly end up cheaper for far better hardware.
Comment by richardwhiuk 15 hours ago
Comment by moregrist 11 hours ago
Comment by mcmoor 6 hours ago
Comment by RevEng 7 hours ago
Comment by spockz 14 hours ago
Comment by TkTech 16 hours ago
Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.
Comment by codethief 14 hours ago
Comment by mohamedkoubaa 10 hours ago
Comment by otabdeveloper4 11 hours ago
Renting a GPU server from a cloud and hosting your own llama.cpp is the path of least resistance.
Comment by wraptile 2 hours ago
AI is definitely different. Cloud compute is incredibly convenient to the point where even if AWS is more expensive it's just so _nice_. LLM models are much more abstract and while I can't easily swap AWS for Hetzner to save 80% of my costs I can absolutely get close to that for many of LLM tasks, even today.
I suspect Anthropic and gang all know that that's why they are buying up dev tools and shifting towards long-running agents because that's where they can get AWS's "nicesness" that they can charge for.
Comment by dreambuffer 17 hours ago
But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.
Comment by pessimizer 14 hours ago
Comment by frobisher 1 hour ago
Do these apply to AI?
Comment by cheema33 17 hours ago
Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.
I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.
Comment by rapidfl 7 hours ago
And once the servers are in space, everything is fully out there.
Comment by Terr_ 14 hours ago
1. Individual dev machines
2. Shared local server
3. Shared server in corporate cloud
4. Third-party LLM SaaS provider
Even if you don't want your laptop melting, there are still some important differences between 3 and 4 in terms of data privacy and security.Comment by chris_money202 8 hours ago
Comment by matheusmoreira 9 hours ago
Which gives all the power to the big techs. I'll never understand why the average company seems to have no problem with this.
Comment by keeda 8 hours ago
I can see how it makes sense for companies, because money is "only money" but an ongoing operational distraction can be much more costly, as in, it can be detrimental to the success of the overall business.
Comment by akoboldfrying 9 hours ago
There's a reason most people pay other people to do these things for them.
Comment by derfurth 17 hours ago
Comment by sathackr 16 hours ago
The OS needs updates, file systems get corrupted.
Fans get dirty.
All the things that you need to deal with in hosting your own server infrastructure you have to deal with when hosting your own AI infrastructure (which runs on servers...)
Comment by ajb 15 hours ago
A lot of the reason people outsource normal software is its brittle security properties, not sure that even applies to an LLM - it can go and look up the latest security best practices just like an engineer can.
Comment by davidw 16 hours ago
Comment by otabdeveloper4 11 hours ago
AI company valuations won't survive if they're only for the "American business model".
Comment by mohamedkoubaa 10 hours ago
Comment by CamperBob2 15 hours ago
You know what gives me headaches? When I'm in the middle of a session and the model gets rug-pulled out from under me because somebody at the model provider didn't pay the Trump bill that month.
Or when someone at the model provider decides that the curve-fitting algorithm in my graphics package looks a little too much like Skynet for comfort.
Or when they do any number of other things to undermine my work for the sake of their business model, some of which I won't even notice until the damage is done.
The sad thing is, if you know how inference works, you know that it really is insanely wasteful for everybody to run it locally. If anything naturally belongs in the cloud, it's inference. But at the same time, what choice are we being given?
Comment by mohamedkoubaa 10 hours ago
Comment by CamperBob2 9 hours ago
for t in tokens_in_context
for p in model_weights
do something with p*t
The expensive part is fetching each weight from memory, which is why VRAM/HBM is such a big deal. Conceptually, for a huge, dense (non-MoE) model, the inner loop might run a trillion times for every token generated.Obviously that's not how it really works in practice, but the point is, if you are only running one prompt at a time, each weight gets fetched, applied to the token being processed, and then never touched again until the next token is processed.
So when you submit a prompt to a model that's running a bunch of other peoples' contexts concurrently, it can reuse each weight multiple times before moving on to the next one:
for p in model_weights
for u in users
for t in u's context
do something with p*t
The same is true in an agent-heavy scenario where you have several contexts in play at once.Worst case, in terms of energy efficiency, is a single user sitting around waiting for a single response. I don't feel like I'm explaining it well, but the core idea is that every time a weight is fetched from memory, you want to get as much work done as possible with it.
Comment by mohamedkoubaa 9 hours ago
Comment by starshadowx2 10 hours ago
Comment by aeonfox 5 hours ago
Nothing stopping turnkey OSS AI hardware being productised, including niceties like opt-in automated updates. If the trend continues of models becoming smaller and more capable for everyday use, it also derisks against obsolescence.
Comment by rapidfl 7 hours ago
Everybody owns a car, washer, TV, etc today. Maybe one could finance a server-box/trailer costing $20k, trade it in every 7 years for a newer model, etc. Many people are going to own a $20k Optimus.
Comment by fragmede 6 hours ago
Comment by jhonof 5 hours ago
Comment by indoordin0saur 17 hours ago
Comment by CamperBob2 15 hours ago
I think that's basically Geohot's business model at Tiny Corp.
Comment by storus 16 hours ago
Comment by nodja 12 hours ago
Comment by wuliwong 17 hours ago
If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.
Fun? Yes. Financially sound? No.
Comment by mohamedkoubaa 10 hours ago
Comment by bityard 16 hours ago
What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.
We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.
Comment by cogman10 16 hours ago
Comment by regularfry 15 hours ago
Comment by cogman10 15 hours ago
That's what I mean by diminishing returns.
Comment by spockz 14 hours ago
We have set up something where you create a ticket, Make sure it contains enough information, and with the right tag added it will make a branch with PR for you which stays up to date based on updates to the ticket and comments on the PR.
It’s creepy in a way. But you also can’t really use local (as in workstation LLM) for that. Sure we could run something like a distributed task scheduler across all our engineer devices but just pushing it to copilot is easier.
Comment by mohamedkoubaa 10 hours ago
Comment by rimliu 2 hours ago
Comment by icoder 17 hours ago
Comment by bluGill 16 hours ago
Accountants are reasonably good at figuring this out - there are a lot of different things that need a large upfront investment before you can charge anything. People still debate if they are correct in this each case.
Comment by esailija 17 hours ago
Comment by 15155 13 hours ago
Comment by themaninthedark 17 hours ago
Comment by otterdude 17 hours ago
Comment by frollogaston 14 hours ago
Comment by fmap 2 hours ago
Imo the more interesting thing to focus on is that there are now several more labs with the expertise and capabilities to train trillion parameter models. That's a serious technical accomplishment and the main reason why open models are catching up to Anthropic and OpenAI (and local models are typically distillations of much larger models).
Who cares that they got some small amount of training data out of Claude. The crux is that the big US labs are not special, they just have a first mover advantage that's slowly shrinking as incremental progress becomes harder.
Comment by ActorNightly 13 hours ago
Comment by pessimizer 14 hours ago
And those are going to all be big enterprise companies that probably will set up LLM services entirely in-house, because they've got the headcount to utilize servers at 100%.
I wonder if there will be (or is currently) business in selling their compute while they're not working, to opposite time zones, etc.
What's left for the big providers will be the dregs of individual subscriptions and small businesses that at their least paranoid might let employees just use their own subscriptions for work.
Comment by sbmthakur 16 hours ago
Comment by pornel 12 hours ago
In a way, it's absolutely amazing that we've went from "Playing 'Set a Timer' on Apple Music" intelligence to something that may pass the Turing Test, but in practical terms the small models are still far from what I'd call "good" for more than a tech demo.
To me, 7B models are just a fuzzy echo of Wikipedia. Gemma models at 4 bit are too clumsy to even reliably generate JSON for tool calls or copy a line of code to apply a patch.
Qwen needs so much detail and babysitting to stop it from doom looping or losing the plot, that the instructions that I need to give are usually longer than the code I end up keeping.
Is there some magic prompt that I don't know? Do other people just have a lot more patience, or way lower expectations?
Comment by papersail 12 hours ago
Comment by cheschire 7 hours ago
We aren’t wealthy enough to have the hardware that would make this good.
The people who have the money to buy a spare maxed out Mac mini just don’t get it. I see lots of folks with RTX 6000’s in threads like these. Or any RTX card that ends in “90”.
Cloud AI is what allows the proles to participate in the broader AI conversation, but not these AI conversations.
Comment by monegator 52 minutes ago
Google (of all companies!) demonstrated you can get useful stuff with reasonable performance with model running local on their smartphones.
Depending on your expecations you can get the local models running on a recent enough laptop, you just need 16GB of ram to be comfortable. It certainly exceeded my expectations (but i don't use the LLM to write code, only to do the real boring stuff: docs.)
Comment by verdverm 10 hours ago
qwen/gemma in the 27/35B range @fp8 are better than gemini-2.5, but less than gemini-3.1, you can run DS4-flash @fp8 on two DGX spark, and things keep becoming better. DiffusionGemma came out recently with 4x token gen speeds.
tl;dr - the models you appear to be trying with are too small or too quant'd
Comment by embedding-shape 17 hours ago
But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.
Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.
Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)
Comment by zozbot234 17 hours ago
Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.
Comment by embedding-shape 16 hours ago
Comment by zozbot234 16 hours ago
Comment by embedding-shape 16 hours ago
I think people used to say the same about the 8B text-diffusion models too when they came out, like LLaDA. LLaDA2.0 seemingly claims 100B total / 6.1B active MoE diffusion (DiffusionGemma is also MoE). Not saying you're wrong about the current consensus, but it has a way of changing over time, might be a bit early to claim it's infeasible to scale them, especially considering the final artifact being much more suitable for local usage.
Comment by famouswaffles 13 hours ago
- consistently proven behind their auto-regressive counterparts in quality. Look at the dgemma benchmarks - pretty steep dropoffs and the more difficult the benchmark the worse the dropoff. That's not a good look and it's not like its some artifact of google's release. Every dllm is like this.
- And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.
>"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"
Put yourself in the shoes of all the labs, even open source ones. Why would you put much effort into this ?
Comment by embedding-shape 13 hours ago
But my entire point is about the reverse of this, the context of what I bring up is in single-user scenarios, which is where these diffusion models really make a large difference in performance.
Sure, I agree it's not a good fit for every single use case out there, everywhere. But after starting to play around with it closer myself, I think people are dismissing it a bit too quickly, at least if you're interested in running local models on your own hardware.
Comment by famouswaffles 12 hours ago
Since training models is currently a very expensive procedure, diffusion llms are destined to be relegated to the occasional research artifact at best. As things stand, making a serious commitment to them is basically the equivalent of throwing money into a fire pit and things are expensive enough as is.
Alternate Architectures that do a much better job matching transformers in quality have basically gone nowhere but you expect one that is basically worse in every way the labs care about won't ? I'm not trying to 'dismiss' dllms. I'm interested in them for the same reason you are. I'm just stating the factors at play plainly.
Comment by zozbot234 13 hours ago
Comment by iagooar 16 hours ago
The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.
I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.
What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").
Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).
Comment by jnaina 2 hours ago
Comment by Barbing 16 hours ago
Comment by nickthegreek 15 hours ago
Comment by dghlsakjg 16 hours ago
Comment by iagooar 15 hours ago
Comment by verdverm 10 hours ago
They all give slightly different results, you can dedup / fusion with heuristics / another agent
Comment by zerd 15 hours ago
Comment by iagooar 15 hours ago
Also, it will just be faster - and more fun too.
Comment by girvo 10 hours ago
But also because it's likely to hold most of its value into the next few years, based on the looks of things, too
Comment by sieste 51 minutes ago
Comment by k__ 11 minutes ago
In theory, other countries should be able to replicate that effort and improve it.
Comment by angry_octet 10 hours ago
Most other trades need to invest significantly in tools. If you want good tooling, you really want 64GB of GPU memory (e.g. 2x 5090) and 96GB of RAM. If I'm paying $200k for an expert engineer then $50k every other year for tooling seems pretty reasonable.
Comment by rsanek 9 hours ago
Comment by fragmede 7 hours ago
It would've been easy to spend $5k on Fable in the short week it was available. If that's the direction things are going (we can assume GPT-6 to be if similar class) $5k's not going to get you "best frontier models at effectively unlimited usage".
Comment by sosodev 17 hours ago
Comment by kristopolous 8 hours ago
I posted this yesterday https://github.com/day50-dev/petsitter
I use it with https://github.com/day50-dev/simple-llm-cli
And modify the "tricks" until my evals get to good numbers. It's a model by model basis.
This is what the larger firms are doing - they have custom prompts per model
Comment by sosodev 5 hours ago
Comment by kristopolous 4 hours ago
I haven't include more sophisticated ones because they are complicated and I wanted to avoid the friction
Comment by schmuhblaster 8 hours ago
It is quite astonishing to see how far local models have progressed, and I think that if you enjoy tinkering a bit, you can save a good bit of money (if you happen to have the hardware lying around anyways). Overall it’s still hard to beat the the cost/convenience combination of a cloud based model provider though.
[0] https://deepclause.substack.com/p/how-to-make-small-models-p...
Comment by phunterlau 1 hour ago
Comment by edg5000 4 hours ago
Comment by chrismarlow9 17 hours ago
Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee
Comment by segmondy 16 hours ago
If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B
Comment by agile-gift0262 1 hour ago
do you find Qwen3.5-122B to be SOTA-level? I moved from it to Qwen3.6-27B (both Q8), and I prefer 3.6-27B, and it leaves me room to spare for other small models
Comment by 0xc0c0c0 17 hours ago
You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.
One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.
Comment by failbuffer 16 hours ago
Comment by ngxson 16 hours ago
The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.
As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.
And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.
I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?
Comment by phainopepla2 16 hours ago
Comment by ngxson 16 hours ago
Also, a lot of my day-to-day tasks perform the same on both small and bigger models: summarize a web page, draft a response, translations, quick web search, etc.
Comment by phainopepla2 15 hours ago
I'm assuming privacy is not a concern since you mentioned using Deepseek already. The cost of V4 Flash for small tasks is so minuscule as to be almost free, and you don't have to deal with a churning laptop (or even buying a high-end laptop, for someone who doesn't already have one).
I guess what I'm really asking is, what's the advantage of using these small local models if privacy isn't a concern?
Comment by ngxson 15 hours ago
Depending on use cases, but for me I found 2 use cases where a local model is a must and not optional:
- Running offline without internet access: for example, I have this project that allow transcribe and summarize audio in real time. I already used it in some events where wifi is not available: https://github.com/ngxson/llama.cpp-realtime-audio-recap
- Handle private personal data, for example health records. This is the same category of "privacy" that you mentioned, but I just want to bring up the fact that people value their privacy differently.
Comment by coder543 12 hours ago
Huggingface's little parameter count badge seems unreliable.
Comment by delis-thumbs-7e 12 hours ago
There’s a lot of things we could use even quite small models for, which would not need an insane amount of computing power and memory, but too few of us is really researching them.
Comment by Tharre 16 hours ago
But for coding in a harness? In my experience it's unusable even for small projects. It just gets hard stuck at every little problem, wasting hundreds of thousands of tokens trying to make a convoluted solution work instead of doing the obvious thing. Or it will spend hours trying to reason through a fairly simple code flow, incrementally adding debug print statements, only to get confused by the output and then editing completely unrelated code that it convinced itself is the problem.
I've tried instead giving Sonnet the problem description and code and have it come up with a detailed plan that Qwen should implement, but doing that actually consumes a significant amount of tokens compared to just telling it to implement everything, and the results are honestly not that much better. There are just too often subtle issues with the plan that Qwen doesn't recognize when implementing, but make the resulting solution it comes up with unusable.
Comment by androiddrew 9 hours ago
ROCm stack is not for people though who aren’t willing to dig in and patch things themselves.
Comment by jnaina 2 hours ago
Plenty fast for coding work and for sharing with my OpenClaw setup.
Currently in the process of adding another external GPU (RTX 4090 with pipeline parallelism) via thunderbolt 5 to the Olares One box, for higher quantization, possibly 8-bit, larger context, better concurrency, more kv cache.
Comment by _doctor_love 17 hours ago
LOL - some of us have a budget
Comment by swatcoder 17 hours ago
If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.
That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.
Comment by frollogaston 14 hours ago
Comment by swatcoder 13 hours ago
Most hobbyists and many professionals could end up far ahead financially by leveraging makerspaces, tool rentals, and co-op shops or even by hiring out a professional to prep certain intermediates for them, but they get psychological value -- as well as flexibility, reliability, and resale opportunity -- from having their own well-outfitted shop.
And they can afford that premium, so they do. At the scale of individuals and small shops, not everything that matters gets captured in financial models.
Comment by frollogaston 13 hours ago
Aside, physical tools tend to be financially advantageous to own if you're going to use them a lot. Even if the owner were targeting 0 profit, they'd have to charge more to factor in the cost of dealing with customers and increased risk of wear/damage by users who don't care as much.
Comment by swatcoder 13 hours ago
Most come with huge privacy concerns, total costs and availability are impossible to forecast very far out, and the specific behavior of frontier models in particular is not something anybody can rely on as those are subscription products that are subject behavior on their publisher's whims (whether from changing system prompts, new "safeguards", retired models, forced "updates", new regulations, etc).
It's quite hard to put a price on all that, and as more people find local models productive enough or develop curiosity to explore models, training, or harness-crafting in their own ways, the marginal cost of buying some shop hardware just sort of disappears into the budget noise for plenty enough people.
Comment by Gigachad 8 hours ago
Comment by amalcon 17 hours ago
Comment by AbsurdCensor 17 hours ago
Comment by amalcon 17 hours ago
Still cheaper than a new Mac. Maybe not cheaper than a used one.
Comment by AbsurdCensor 14 hours ago
Comment by tjwebbnorfolk 17 hours ago
Comment by techscruggs 17 hours ago
Comment by Shekelphile 17 hours ago
Comment by psychoslave 17 hours ago
Top 10% of global earners (~800M people) can afford a $2,000 device without major financial strain.
Top 25% (~2B people) could afford it with some budget adjustments.
Bottom 50% (~4B people) would find it prohibitively expensive.
So for a SV top income, maybe that might look more like the weekly pet brushing budget, but for most people out there this is not that much of a no-brainer.
Comment by disgruntledphd2 17 hours ago
Comment by weego 17 hours ago
Comment by frollogaston 14 hours ago
Comment by richwater 17 hours ago
Comment by themythfable 17 hours ago
Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.
Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.
Comment by embedding-shape 17 hours ago
There are segments, everything from "Average person in world" to "Average creative professional using computers for work" and more on HN, with a wide range of costs for the hardware. HN probably skews towards the latter rather than the former, probably sitting with enterprise hardware next to them basically for fun, hard to make wider conclusions from what people here have or not.
Comment by sublinear 17 hours ago
It's just for gaming and AI now. Maybe not even gaming as much anymore.
Consider the perspective of someone who has a practically unlimited budget for PCs, doesn't game much anymore, and doesn't need AI to do their job. It's just part of getting older, and there are plenty of people in their late 30s and older on here.
Comment by anarticle 17 hours ago
Comment by p-e-w 17 hours ago
Comment by dofm 17 hours ago
Comment by minton 13 hours ago
Comment by fendy3002 13 hours ago
Comment by ios-contractor 2 hours ago
Comment by andwhatisthis 1 hour ago
Comment by anubhav200 17 hours ago
Comment by K0IN 10 hours ago
I really wonder when companies will start hosting theire model for everday tasks on prem, cause its good enough (and realative cheap), instead of paying subscriptions for all devs.
Comment by dejawu 16 hours ago
I don't vibe-code, but I do decide what to implement and what patterns to use (perhaps asking the model to analyze and give advice on this first), then I have it handle the nitty-gritty of the implementation itself. For this usage style, the latest local models are as good as having Claude at home.
I won't say it's been _easy_ (I ended up implementing my own harness to accommodate the idiosyncrasies of local models), but I will say that for the effort, having a coding agent that's essentially free to query as much as I want has been life-changing as a dev, especially when it comes to working on side projects. Knowing that my agent will never get worse in quality, suddenly cost more than it does now, or be suddenly made unavailable by external factors, was absolutely worth the trouble. And on top of all that, I can't believe it's as good as it is.
Comment by simonw 17 hours ago
These models are very capable, and use around 20-30GB of RAM while they are running.
Provided you have 64GB of RAM that leaves space for running other applications at the same time.
Comment by chrisweekly 17 hours ago
Comment by simonw 16 hours ago
I used to assume that anything GPT-4 equivalent or higher would need $30,000+ of server-class hardware.
That said... gemma-4-12b-qat is 7.15GB on disk so should run reasonably well in 16GB, that takes it down to MacBook Air territory https://lmstudio.ai/models/google/gemma-4-12b-qat
Comment by frollogaston 13 hours ago
Comment by bayshark 12 hours ago
Specs: qwen3_17b_base.Q6_K.gguf selora-v047-answer.f16.gguf selora-v047-automation.f16.gguf selora-v047-clarification.f16.gguf selora-v047-command.f16.gguf
The full base model and LoRA adapters are only 3.5GB
Capabilities include configuring for smart home setup to help with answers, clarifications, commands, and creating automations in Home Assistant. The models with the LoRA adapters were made with lean scripted data made specifically for Home Assistant. A lot of work was put into this, feel free to give it a try and happy for any feedback!
Comment by linuxhansl 7 hours ago
What is true is that it gets easier and faster to run local models. With QAT (quantization aware training), turboquant (or similar) K/V compression; what used to be impossible to run is now fairly easy.
I can run gemma4:26b-a4b-qat on my laptop with 20-30 tokens/s with a 256k context window. That was unthinkable just 6 months ago.
So the local models are "OK" for small'ish projects.
But it does not at all(!) compare to the frontier models. For a large project Claude's Opus 4.6+ just work, whereas local gemma tangles itself up, makes weird mistakes, and just can't handle it (for those cases it is faster if I do it myself).
If the trends continues, with 1.58bit QAT models, even better K/V compression, faster multi-token prediction et al, maybe soon it will be comparable.
Comment by infogulch 12 hours ago
The most "affordable" option is red v2 with 64GB GPU ram and costs $12,000. This is only ("only") 1.5x-3x the price of a beefy desktop (https://pcpartpicker.com/builds/), and could crush inference work even on bigger models. It could support coding tasks for a small team of developers, or run an AI agent for every person in your household...
Comment by pornel 9 hours ago
If you have $12K to spend, you may be better off with DGX Spark or a Mac with 128GB VRAM. That can (barely) fit DeepSeek V4 Flash.
Comment by gregwebs 15 hours ago
Claude code supports this by setting the model to "opusplan"- it will automatically use Opus for planning and sonnet for implementation. This was completely necessary with the fable release. I was able to do this with fable and it was necessary to avoid getting quickly rate limited. In settings.json:
"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },
Obviously have that set to "claude-opus-4-8" now.
Comment by noveltyaccount 13 hours ago
Comment by richbradshaw 17 hours ago
Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!
Comment by simonw 17 hours ago
With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.
Comment by AbsurdCensor 17 hours ago
Comment by pizza234 17 hours ago
Comment by tpurves 6 hours ago
Comment by BenRacicot 6 hours ago
Comment by AgentMasterRace 1 hour ago
Comment by ptx 13 hours ago
How does that work? The script in the post references the file "docker-compose.sandbox.yml", but I don't anything about what that file does.
The post that this one links to, that it's based on, says that Pi doesn't do proper sandboxing.
Presumably bash can still execute other binaries, otherwise it would be fairly useless. What stops it from executing Python? Or opening a network connection and downloading Python?
Comment by huydotnet 16 hours ago
Comment by polotics 12 hours ago
Running the same prompt on both with the same .md memory state...
Gemini3.5 is more "intelligent" but Antigravity gets it to decide to go on tangents that are quite time and token-consuming I think. Nice casino machine.
Pi+Qwen3 (~80GB, llama.cpp) is like vibecoding about 1.5 years ago, when you had to babysit, structure your program to have self-contained chunks, and keep an eye on all the cross-cutting concerns to not trip it up. When it works it works fine and when it fails it's my job to ensure it fails fast.
The code is about 10'000 lines of Kotlin in total so it already takes some effort to keep it simple for the AI. It's not a slopped quantity of code, i got solid feature creep :^)
https://play.google.com/store/apps/details?id=com.sixteenam.... ...hat tip to the recent copycat squatter btw it's an honor!
Comment by xbmcuser 3 hours ago
Comment by ricardobayes 2 hours ago
Comment by ltononro 16 hours ago
The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).
I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.
IDK, might have gone a little bit off-topic here.
Comment by lanycrost 3 hours ago
Comment by aquarious_ 15 hours ago
Comment by pjmlp 15 hours ago
> 64 GB RAM and 1TB storage
Ah ok, not something regular joe and jane happen to have lying around at home.
Additionally the whole configuration is still very much low level, bunch of CLI commands, and if the model doesn't fit for the task at hand, it starts allucinating, generating gibberish, whatever.
Comment by sparkling 13 hours ago
Comment by wxw 17 hours ago
To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.
Comment by andix 15 hours ago
Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.
Comment by jlengrand 14 hours ago
Comment by hank808 9 hours ago
Comment by abalashov 15 hours ago
However, like many commenters, I don't really believe in vibe-coding, long-horizon agentic one-shot agentic coding, etc. and do not use LLMs for huge generation tasks that involve designing things end-to-end.
I also have an MBP with 128 GB of unified memory and do quite a bit of Qwen3.6-35B-A3B. No, it's not as smart as the aforementioned models, to say nothing of frontier, but many people seem pleasantly shocked by the number of banal tasks that do not require these.
Comment by noveltyaccount 10 hours ago
> “Our goal is to deliver unmetered intelligence to every home and every desk with Windows,” said Satya Nadella, chairman and CEO of Microsoft. “RTX Spark marks a real breakthrough towards that vision.”
Makes me optimistic that those two companies are going to keep investing in quality local models.
Comment by aliljet 17 hours ago
Comment by rsolva 16 hours ago
Comment by robertkarl 14 hours ago
Having a local Qwen check another Qwen's work increases the accuracy quite a bit at the cost of more latency. You can't have your cake and eat it too.
In benchmarking local models, I'm having success increasing even a 9B qwen's score on terminal-bench adjacent problems, just by asking it to plan and handing the plan back to qwen with a fresh context. Try it with Qwen3.5, unsloth Q4+, and a thinking budget of around 1024 tokens.
Comment by aidenn0 5 hours ago
Comment by kristopolous 4 hours ago
Also the R9700 rocm is 32gb, 1350, available now. It's like 1/3 the price of what 5090s go for and you can get the slimmer models for that price so you can pack more in.
If I had to build right this second I'd do small form factor strix halo with a Radeon card.
You can get all those parts in like 3 days, msrp, no hassles. the only thing you're paying out the nose for is the ran
Good news is mobo manufacturers are adding more slots so you don't have to get robbed paying for 32 or 64gb modules
Comment by valisvalis 16 hours ago
Comment by LolWolf 8 hours ago
Comment by cautiouscat 17 hours ago
The good old butt dyno!
I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.
I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.
Comment by cube00 17 hours ago
Comment by glaslong 16 hours ago
Comment by b3ing 14 hours ago
Comment by skittleson 9 hours ago
Comment by zx8080 8 hours ago
Does it really needs a GPU at 300Watts to do all that tasks?
Comment by daniban 16 hours ago
Comment by jszymborski 15 hours ago
Comment by jotato 16 hours ago
I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.
Comment by throwarayes 16 hours ago
I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.
Comment by fridder 16 hours ago
Comment by sn0n 6 hours ago
Comment by MrKoby07 14 hours ago
Comment by mohamedkoubaa 12 hours ago
Comment by anax32 17 hours ago
Running locally is the bar; it's hard to make these things a service which scales.
Comment by k__ 15 hours ago
I'd assume a Mac with 32-64GB memory would get some reasonable results.
Comment by ta-run 14 hours ago
After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.
Comment by WASDx 15 hours ago
Comment by walmas 9 hours ago
Comment by henryoman 3 hours ago
Comment by blobbers 14 hours ago
I've often wondered why the hype around apple neural core when 99% of software doesn't use them.
Comment by genxy 8 hours ago
https://github.com/ml-explore/mlx-lm
Having used half the systems that Vicki mentioned, mlx was the best balance between power and ease of use. Just a pip install away.
Comment by lthi747 12 hours ago
Comment by prlin 16 hours ago
Comment by wrxd 16 hours ago
Comment by stared 17 hours ago
Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...
When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.
Comment by iagooar 16 hours ago
Comment by Patchistry 4 hours ago
Comment by malkosta 16 hours ago
Comment by nikagrawal121 13 hours ago
Comment by ibizaman 17 hours ago
Comment by bthornbury 15 hours ago
I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.
Comment by ridruejo 15 hours ago
Comment by xienze 17 hours ago
The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.
Comment by osigurdson 14 hours ago
Comment by fl4regun 16 hours ago
Comment by 0xbadcafebee 13 hours ago
Larger models just do more complex reasoning. But if you want them to be really good, you need a beefy Mac. They have the best combination of memory bandwidth and RAM to allow medium-sized models to run at speed. GPUs have less memory but more bandwidth, and AMD iGPUs have more memory but less bandwidth. The Mac is the best compromise on the market today.
Once you do have a beefy Mac, you want to run a dense model. This gives you the best possible result with the system you have. You can go MoE for faster results, use cutting-edge inference techniques, parameter tweaks, etc. But a basic dense model (at Q6 quant) on a big-ass mac will serve 90% of your coding needs.
Comment by wasimxyz 17 hours ago
Comment by frollogaston 14 hours ago
Comment by drchaim 16 hours ago
Comment by tennfown 15 hours ago
Comment by atulmy 14 hours ago
Comment by aleksandrm 11 hours ago
Comment by dakolli 7 hours ago
Comment by dakolli 7 hours ago
Comment by matrix12 12 hours ago
Comment by etoxin 1 hour ago
Comment by Computer0 10 hours ago
Comment by ZionBoggan 16 hours ago
Comment by Mr_Eri_Atlov 12 hours ago
Gemma 4 and Qwen3.6 27B aren't perfect, yet they are such a step forward from the previous generation that it's both feasible to get stuff done locally with patience and very likely that future releases will subvert cloud capabilities entirely.
Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.
Comment by jmyeet 13 hours ago
1. Memory bandwidth
2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;
3. Raw FLOPS, including quantization.
Apple here is interesting because they have a shared memory model and you can buy Macs currently with up to 128GB of RAM (previously 256/612GB on Mac Studios, both discontinued). New M5 Mac Studios are expected in Q3 but that's not guaranteed. It may take until next year
Depending on the chip, Macs top out at ~900GB/s. A 5090 or 6000 Pro has 1800GB/s. A B100 is at like 3.2TB/s. A 5090 has, depending on how you count, 5-7x the FLOPS of a M5 Pro so a 5090 is still better than any current Max... except for the 32GB limit.
NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.
So let's project this into the future a little. The M6 Ultra/Max may well be 1TB+/s memory bandwidth with much higher FLOPS and thus actually be competitive for larger models. A 6090 in the current market will probably still have 32GB of VRAM if I had to guess. Maybe it goes up to 48GB.
But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.
Comment by jingw222 15 hours ago
Comment by jauntywundrkind 13 hours ago
One caveat, I have absolutely no patience for a lot of subagent systems, like opencode, where the subagent is walled off and incommunicatable. My subagents really should be their own session, that i can deal with as I please, with some MessageChannel like offerings/tools available to them. Ideally with modes where messages auto-flow in and out, and modes where I can be a gate-monitor. https://developer.mozilla.org/en-US/docs/Web/API/MessageChan...
Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...
Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!
Comment by monegator 16 hours ago
So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.
The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.
So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.
I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:
At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)
Wish i had 3 times the RAM so i can see what happens with more context.
Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.
This was the Qwen 3.5 9B model.
I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.
In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.
Not bad for stuff running on a business laptop, while doing actual work.
Tomorrow i will try Qwen 3.6, let's see how it goes..
Comment by holoduke 15 hours ago
Comment by jkwang 19 minutes ago
Comment by pcell 5 hours ago
Comment by hottrends 10 hours ago
Comment by Littice 9 hours ago
Comment by aplomb1026 14 hours ago
Comment by eugmai86 14 hours ago
Comment by kordlessagain 17 hours ago
Comment by RishiByte 14 hours ago
Comment by Veer_Pratap08 16 hours ago
Comment by maxothex 16 hours ago
Comment by mrkn1 11 hours ago
Comment by azzzxcc123 15 hours ago
Comment by huflungdung 15 hours ago
Comment by Rekindle8090 15 hours ago
Comment by Lapsa 13 hours ago
Comment by iluvcommunism 17 hours ago
Comment by zrg 1 hour ago
Comment by fg137 16 hours ago
I closed the article after that.
The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.
Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.
Comment by orf 16 hours ago
What % of developers could afford an older MacBook model, second hand? Far, far more than 1%.
Comment by DiabloD3 3 hours ago
Comment by fg137 11 hours ago
I am pretty sure even among software engineers, much fewer than 1% are going to spend their money on that.
Most software engineers know how to spend their money responsibly.
Comment by orf 10 hours ago
Comment by fg137 9 hours ago
Comment by orf 9 hours ago
> could or will? much fewer than 1% are going to spend their money on that.
It’s ok to change your point, you don’t need to get combative.
Not that it makes any difference, given their ~10% market share.