Running local models is good now

Posted by jfb 18 hours ago

Comments

Comment by c0rruptbytes 16 hours ago

I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You have dense models (qwen 27b, gemma 31b) who are pretty smart, but pretty slow

You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

You need a lot of memory to run these well, quantization makes tool calling weaker, so most run at 4 bit quants and are wondering why it kinda sucks and that's because you've essentially lobotomized the model (I recommend unsloth quants, i recommend 6bit for MoEs and 5bit for dense)

So you need a lot of compute to make the pre-fill fast, you need bandwidth to make the decode fast, you need a lot of memory to hold everything - lot of ifs

On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

So are they good? not really. Do they work? yes

edit: just wanna clarify - i think open models are the future, i think they're super important, i'm contributing constantly to the ecosystem - i think people should play around with these models, i think people should use `pi` and learn how it all works - but don't download a model expecting it to be good out of the box, you will have to tune and configure a lot of stuff to replace a "coding agent" that most people are using models for

Comment by saghm 16 hours ago

This is basically my experience as well. I have a moderately recent but high spec desktop (Radeon 6900 XT with 16 GB VRAM, Ryzen 9 7900X 12-core, 64 GB system RAM), and I tried out some recommended models with ollama a month or two ago. Anything not geared specifically towards coding seemed to struggled with actually making tool calls instead of just stating the actions they would take without making them (and trying to get help from them to explain what I needed to configure to change that behavior was useless; qwen refused to believe that it was running in ollama and insisted that it was running from the Alibaba cloud without access to my local system), and the models intended for coding were barely thinking faster than I could type (if they had any ability to show thinking at all).

The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.

Comment by rapind 15 hours ago

> The best "free" experience I've found is using OpenCode with Big Pickle.

I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.

I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.

Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.

Comment by milesvp 7 hours ago

If you think your data isn’t being hoovered up I’d like to point out that every model is possible due to federal crimes committed to obtain the information they were trained on. Regardless of how much you are paying, your data is worth another petty civil infraction.

Comment by horacemorace 4 hours ago

A million times this. There is “private” as a corporate-legality licensing perspective. There is “private” as a human concept. The two are seemingly opposite, yet as all the money is focused on the former there’s no airtime left for the latter.

Comment by larodi 2 hours ago

The curiosity is that these companies somehow got around crimes and are above law (1) and these crimes mean something in a limited jurisdiction, like copyright laws of USA/Canada are not world’s (2). So it’s all cyberpunk at this point.

Comment by aamoscodes 15 hours ago

You can pay, and also use deepseek-v4-flash. OpenRouter even lets you "block" or limit your usage to providers that don't train on data. Since the weights are open, other companies are already serving the model on non-DeepSeek owned hardware: https://openrouter.ai/deepseek/deepseek-v4-flash

Comment by fc417fc802 8 hours ago

> OpenRouter even lets you "block" or limit your usage to providers that don't train on data.

More than that, they have various zero data retention options and provide a convenient json list of them.

Comment by larodi 2 hours ago

The fact OpenRouter strips https to reroute screams danger already.

Comment by fc417fc802 2 hours ago

What do you mean? Are you objecting that they communicate with the provider on your behalf? But how else would you design such a system?

Plumbing you straight through would require nonstandard certificate juggling and they wouldn't be able to implement their core service of providing a standardized API nor could they transparently route your request to the fastest / cheapest / whatever provider on the fly nor could they implement transparent fallback nor could they implement their policy of not billing you if the response from the provider is invalid.

Also the chosen provider could fingerprint your network stack if you communicated directly. The routing service is acting as a proxy and for most providers fully anonymizes requests (it does send a stable uid to some of them though).

Comment by rapind 14 hours ago

Good to know. I hadn't checks since early is DS4's launch when they were the only provide (I think maybe there was one other, but they also trained on your data). I see several private options now.

Comment by darkmarmot 14 hours ago

Hard to guarantee it's private if you don't keep it local... I don't have a lot of trust for companies in this space.

Comment by rapind 14 hours ago

Yes, but I think that'll change eventually. If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist. At least that's my theory.

There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.

Comment by rob74 57 minutes ago

My company has all the code in a private GitLab instance (almost everything else is on AWS, but not GitLab), but they still use Cursor, so our internal code gets sent to whatever AI company the model I select in the dropdown belongs to. Scary if you think about it: if you use Cursor, you don't have to trust only one specific AI company, you have to trust all of them...

Comment by naikrovek 8 hours ago

> Yes, but I think that'll change eventually.

Maybe people will trust companies, but those companies will rarely deserve that trust. Anyone that pays attention sees breach announcements almost every day. Security is never a concern for these companies until it embarrasses them. Then, as soon as the negative attention fades, security again becomes the second to last priority.

Do not trust companies with any data that is important to you unless the effective management of that data is required by law, and the laws are comprehensive.

Comment by fc417fc802 8 hours ago

If your contract says there's no data retention and then a bunch of your retained data gets leaked in a breach presumably you have grounds for a lawsuit.

Comment by pessimizer 14 hours ago

> If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist.

I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.

Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.

Comment by jen20 4 hours ago

I trust AWS in this space. I'm 100% sure that they will be precisely honoring the terms of service for Bedrock (I've never looked to see whether they claim to train on your data though).

Comment by kube-system 4 hours ago

You didn’t look because you subconsciously know you don’t need to. AWS has a solid track record, and the certifications and audits to back it up. and that’s why everyone trusts them including the most extreme of regulated industries.

Bedrock in fact does not train on your data. It was a big deal when it was announced that they share data with Anthropic for Fable, but even then it was gated away where you’d have to explicitly allow it.

Comment by rlkf 12 hours ago

> Basically I want Hetzner and OVH to run open model clouds

You can run Qwen3 on OVH already:

<https://www.ovhcloud.com/en/public-cloud/ai-endpoints/catalo...>

Comment by johndough 11 hours ago

I see that OVH offers Qwen3.5-397B-A17B, which is a bit surprising to me. I thought that EU providers had to comply with the AI act where you have to provide opt-out and information about the training data once the model is sufficiently large (over 10^23 FLOPs, likely the case here), but providing information is not possible since people who train those models only give vague information at best.

Does anyone know if OVH is ignoring the law here, or whether it does not apply for some reason?

Comment by nl 4 hours ago

OVH is acting as a "Deployer", not a "Provider", which have special meaning under the AI Act.

There are much less (almost no) disclosure regulations on the deployer.

https://ethicalogic.com/articles/gpai-guide-roles-public-dat...

Comment by dofm 2 hours ago

Pretty convenient, it must be noted, for a market that does not have any meaningful home grown models.

Comment by dofm 10 hours ago

Which law is that?

Not doubting you — just want to read it!

Comment by johndough 10 hours ago

Article 53 of the AI Act: https://ai-act-law.eu/article/53/

The definition of a "genral-purpose AI model" is described in more detail in the "Guidelines on the scope of obligations for providers of general-purpose AI models under the AI Act": https://ec.europa.eu/newsroom/dae/redirection/document/11834...

Comment by dofm 10 hours ago

Thanks, v. interesting.

Comment by zwaps 5 hours ago

Does not apply to oss models

Comment by dofm 2 hours ago

Does it not apply to hosting and running them for money? How would it not?

Comment by saghm 14 hours ago

I'm probably somewhat adjacent to you. I would be happy to pay, but I just don't want to pay any of the companies that are actually offering things right now. I had the $20/month sub for Claude for a couple months, until one day I kept inexplicably getting errors saying I hit the limit even though their site showed my usage at less than half for the session and 8% for the week, and it seemed silly to pay for something that couldn't even properly respect its own measurements. OpenAI sketches me out too much as a company, Cursor feels lackluster when I use it for work from the account they pay for (and now is getting acquired by maybe the only AI company even sketchier than OpenAI), and I wasn't particularly impressed with Gemini or Mistral Vibe either when I tried them on the free tiers either.

Comment by rapind 13 hours ago

I was paying around $500 / month on average between multiple providers for over a year. I cancelled one a while ago because of pretty bad service availability (Bet you guess who that is!), which by all reports hasn't improved much.

For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).

Comment by gaolei8888 13 hours ago

who?

Comment by Bnjoroge 15 hours ago

You can specify which providers you want to serve your model in OpenRouter. Then you can chose US-based ones.

Comment by djmips 4 hours ago

Did you try Claude Fable?

Comment by 13 hours ago

Comment by bel8 14 hours ago

These competent open models you want to use were trained on data from people like you and me.

I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.

Comment by yencabulator 14 hours ago

MIT and Apache 2.0 both require attribution, so it's not like limiting to those would help in license compliance.

Comment by spockz 14 hours ago

For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands.

Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.

I haven’t tried any tool that compresses the tokens yet.

Comment by echelon 13 hours ago

I would rather we give up the idea of running open models on RTX cards and instead focus on running much bigger open models on H200s.

1. The hardware will eventually catch up.

2. This keeps the delta between frontier models smaller.

3. We can still fine tune and own the weights.

4. The models will be more useful, faster, and reliable.

RTX is hobbyist tier, not professional tier.

Gated cloud models from hyperscalers treat us like hobbyists in their own right.

We need equivalent scale models, but open.

Comment by zozbot234 13 hours ago

H200s and other enterprise datacenter GPUs are completely overkill in any realistic single- or few-users inference scenario. They're hugely unbalanced towards compute capacity which will go almost entirely unused (i.e. wasted) unless you're running huge batches on a continued basis. I've argued many times that local inference engines should support batched inference on a somewhat smaller scale for a variety of reasons (especially given the unexpected effectiveness of SSD streamed inference with larger-than-RAM models), but even I don't think we can realistically go to 300x or so for real-time inference, which is the range that pencils out quite consistently from a simple roofline model of these datacenter cards.

Comment by echelon 12 hours ago

If you're doing professional work in coding or video, you can easily saturate a single H200.

This is what RunPod-type services are for.

For instance, ComfyUI is an abomination that can't do half of what Nano Banana and Seedance 2.0 can do. And you have to sit around and wait 10x longer for single results.

I can rent an H200 for $3.50 an hour. That's INSANELY cheap.

I do not understand this split between hosted APIs and rinky-dink local RTX models. Both suck.

The ideal solution is models we own run on RunPods leveraging H200s.

I can spend $100-200/day on compute making much more value with the model outputs.

----

edit: I want to respond to comments, but the damned HN rate limits keep me to five comments a day now because I'm a contrarian and say things that rile up the anti-AI folks.

You don't need to buy an H200. It's a depreciating asset. You rent one. It's cheap to rent.

Comment by spockz 12 hours ago

Sure, to approach frontier model quality locally we need to have more power. And H200s are a way to get there.

However, we need to use the tools that we have. Even if I wanted to buy a (bunch of) H200 for me and my colleagues and could get the expense approved, they are hard to source where we are.

Yes. You can rent them, but I’m not sure how that affects the IP discussion.

Moreover, not everyone is doing coding and video so we have different tasks that can fit quite well on relatively light laptops (Gemma et al), for relatively directed coding sessions we can make do with RTX cards, or a small step up, all the way to H200 in the workstation. Or pods thereof.

We have the graphics cards and laptops with MLX right now. The H200 will take a year at least to arrive. Better get used to run stuff locally.

Comment by zozbot234 12 hours ago

I'll definitely believe that for video generation models, but those are also very compute-intensive for rather middling results.

Comment by what 5 hours ago

> I’m a contrarian that says things that rile up the anti-AI folks

That’s hardly contrarian here, lol.

Comment by echelon 5 hours ago

Are we experiencing the same website?

I swear, two thirds of the folks here just make comments that dunk on AI. They underestimate it, hate it, hate those that use it, etc. It's the "old angry man yells at cloud" trope.

I've had so many consecutive days of "-4" karma posts that HN is blocking me from commenting. And the comment retorts I get from these folks are absolute gems that will undoubtedly age like milk.

Comment by SR2Z 13 hours ago

That GPU costs 25k which means you really should have a rack to put it in. It's not realistic.

Comment by dofm 10 hours ago

Pressure on small model quality and design is absolutely what is needed. There are still gains to be made.

Comment by FridgeSeal 5 hours ago

Ah yes, because of all the people at home with computers who have…checks notes…datacentre GPU’s lying around.

Comment by MrLeap 13 hours ago

There's a lot more professionals that have RTX cards than H200s. You're inevitably see more development and experimentation on things actual humans have lmao.

Comment by redmalang 14 hours ago

Try llama.cpp it seems to be a lot more performant and a lot more hackable. Also I'm surprised how substantial the impact of some of the inference configs (beyond just temp) can have, though this is much more model specific.

Comment by boppo1 9 hours ago

I have almost your system specs, how do they work for non-coding stuff like chat/knowledge/discussion? I've been using models to talk through social stuff I'm anxious about but dont want to annoy my friends with and it's been amazing, but I don't want to share that info with google/openai/anthropic anymore. I shouldn't have in the first place, but I couldn't help it, the exercise was too interesting.

Comment by fc417fc802 7 hours ago

You can test the open models for yourself using the various router services. Those also make it easy to use providers other than the major players.

Comment by ryukoposting 14 hours ago

I found that, with the heavily quantized Qwen3 models I can cram onto my 3060 Ti, telling the model to use its tools in the system prompt made it a lot more likely to actually do it. YMMV of course, but give it a shot.

Comment by saghm 11 hours ago

I did try this, and it was pretty hit-or-miss still. I even went as far as configuring context for Zed to inject into all conversations saying stuff like "If you need to read a file, call read_file NOW. Do not say you will read it", and it still didn't really make a huge difference.

Comment by aftbit 16 hours ago

IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

Comment by ryan_glass 13 hours ago

For a fraction of the price of 96GB vram, I built a desktop based on a supermicro server mobo and EPYC 9 series CPU, with just under 400GB rdimm ram (approx $4500 all in but this was before the ram price hike). Works really well for serving larger local modals at a decent enough speed (I consider anything more than 10 tokens/second usable and value accuracy over speed).

Comment by dofm 15 hours ago

FWIW I think it might be both.

Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.

But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).

Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.

Comment by EagnaIonat 14 hours ago

Depends what you need the model to do. The recent granite4.1:3b just takes 2GB of memory and is fast. Results are pretty good and support tool calling. Barely a squeak out of the Mac laptop.

Even faster with the MLX builds.

Then when I need more heavy lifting I fire up a larger model.

IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.

Comment by aftbit 12 hours ago

Yeah it 100% depends what you want the model to do. Some tasks, like extraction, summarization, or simple tool calling (e.g. "turn on my desk lamp") are very doable with tiny models. Others, like coding or more advanced agentic workflows can demand much more powerful models. I was thinking from the lens of coding or running _big_ data extraction pipelines (think ~8 billion pages).

Comment by EagnaIonat 3 hours ago

> thers, like coding or more advanced agentic workflows can demand much more powerful models.

You can do coding and agentic fine. For coding I use qwen3.6:35b-mlx and agentic granite4.1:3b works fine.

These are the models I use.

- granite4.1:3b

- granite4.1:30b

- gpt-oss:20b

- gpt-oss:120b (less so now)

- mistral-small3.2

- qwen3.6:35b-mlx

There will always be use cases that don't sit on your laptop, but most of what can be done can be done locally, it just requires a good framework to sit on it.

Comment by azeirah 4 hours ago

Qwen 3.6 27B performs similarly to sonnet 4.5 (note I said 4.5, not 4.6) when it comes to coding. It runs amazingly well on my PC with a 7900xtx.

It's worse at general tasks, but in the precise domain of coding I actually prefer to use it over my claude subscription because it has 0 latency (and no privacy concerns whatsoever).

Comment by girvo 10 hours ago

> DGX Spark-alike is really just asking for trouble. Prefill kills perf.

You're right that prefill kills perf, but shrug the GB10 has far more compute than it has memory bandwidth, so prefill isn't it's bottleneck.

Comment by htrp 9 hours ago

I've seen the same, Sparks are great at non time-sensitive tasks. if you can set up a agentic loop that does not require human intervention, you can design around the memory bandwidth limitations

Comment by girvo 8 hours ago

The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.

I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine

Comment by EnPissant 1 hour ago

I thought MTP wasn't very useful on MoE models because the expert overlap for 2 tokens was too small.

Comment by wincy 14 hours ago

If I could just save up $6000 I could sell off my RTX 5090 for $4,000 and buy an RTX 6000 Blackwell Pro Workstation. I can fit models into the 32GB of vram but my context window ends up being tiny for any halfway capable model.

Comment by layer8 14 hours ago

Isn’t the RTX 6000 Blackwell Pro Workstation over $13000 now?

Comment by wincy 10 hours ago

Dang, that’s crazy. Last I checked they were $10,000. It seemed almost attainable to me as a mere mortal just last year. I’m glad I at least got enough vram and ram to play around a little bit with local models before all the prices went bananas.

Comment by girvo 10 hours ago

And rising. It's depressing.

Comment by qudat 9 hours ago

I feel like the claims come from wildly different personas and use cases. A 24gb vram, 5 year old titan run 27b at 30t/s and the results are good. I use sonnet and opus at my day job and they are more capable but I can still get the same out of qwen, I just need to be mindful of ctx

Comment by jtbaker 14 hours ago

> Trying to run them on a unified memory Mac

> but still not quite in the realm of Sonnet or DeepSeek 4 Flash

these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4

Comment by trueno 13 hours ago

someone just put this on my radar yesterday, im about to try this today. how's your experience with it?

me thinks there's a lot of optimization strats we're currently leaving on the table just because the amount of things to explore and test are so expansive. but this one is super interesting targeting metal primarily and zeroing in on one model. instead of a one size fits all llama.cpp im very interested to see if theres a future where super tailor-made variants per model pans out to harnesses that can rapidly switch ultimately providing something akin to sonnet/early opus territory (that's my personal bench mark of good-enough i shall now cancel the hell out of this claude sub)

Comment by jtbaker 12 hours ago

I'm on the verge of cancelling my anthropic $20 plan since it's come out. On an M5 Max 128GB, hooked up to the pi.dev harness, I get in the neighborhood of 400-450tps prefill and 30-35tps generation. It is imminently usable and at times feels more stable than my previous CC setup. Occasionally there are things it struggles with that I will bounce back over to CC for, but it is highly usable. The future is bright for local models! As a tinkerer, it makes me really happy to have a local setup I can be just as productive in, and not have the token overlords ready to shut me down at any time.

Comment by aftbit 12 hours ago

That's DS4 Flash right? How does it feel in intelligence and speed compared to DS4 Flash hosted by Deepseek themselves or another API provider? I've been using API DS4 Flash for a lot of personal projects and have been quite impressed. I've spent $1 on building ~10 toy projects and gotten them all to work within the bounds of what I wanted without having to do much besides guide the model away from dumb loops.

Comment by jtbaker 12 hours ago

I'm using the DS4 flash IQ2 2-bit quant, per Salvadore's recommendations for my hardware in the repo. I haven't messed with the cloud hosted variant. The only other paid API I have messed with is a $20 Anthropic sub, primarily with whatever the latest version of Sonnet is. For the most part, this local configuration feels on par with that.

With this configuration (set up over the last month) I have been working on Python data processing tools, an internal Svelte 5/SvelteKit data intensive BI app, and some smaller Rust projects. It's been doing really well there.

Comment by monksy 12 hours ago

That RTX6000Pro you mentioned is $12k.

Comment by aftbit 12 hours ago

Yep - I'd say either that or 4x 5090 is a great entry point to running local models "well". Two of them would be even better. If you don't have $12-24k to spend, you can try your hand with tiny models or quants or slow speeds, but it will be a much more painful experience. You're already giving up a lot by dropping down from frontier models - you're giving up even more by trying to squeeze them into little RAM and compute.

Prices will fall in the next few years. Maybe just play with the tiny toy models for now to learn how they work, then keep using API providers until they do.

Comment by eek2121 16 hours ago

Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.

Comment by mathisfun123 16 hours ago

can you give more info? llama.cpp vs vllm? config? i wanna try specifically this model

Comment by zozbot234 16 hours ago

Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.

Comment by stemlord 11 hours ago

> It's still worth it to avoid becoming massively reliant on centralized services.

This isn't really good enough. Many of us need to get things done in a pinch and if our employers are already getting used to the idea of paying for enterprise subscriptions to cloud llm's then the local option needs to be good

Comment by wolvoleo 9 hours ago

For me I use only cloud for work. But I'd never trust any of my personal data to it.

Comment by greenavocado 16 hours ago

I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.

This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.

I don't have enough system RAM to properly handle the large context windows so I don't use local models.

  # 1,257 tokens 17s 72.18 t/s

  $env:CUDA_DEVICE_SCHEDULE = "SPIN"
  cd D:\src\llama.cpp\
  .\build\bin\Release\llama-server.exe `
    --port 8080 `
    --host 127.0.0.1 `
    -m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
    -fitt 2048 `
    -c 98304 `
    -n 32768 `
    -fa on `
    -np 1 `
    --kv-unified `
    -ctk q8_0 `
    -ctv q8_0 `
    -ctkd q8_0 `
    -ctvd q8_0 `
    -ctxcp 64 `
    --mlock `
    --no-warmup `
    --spec-type draft-mtp `
    --spec-draft-n-max 2 `
    --spec-draft-p-min 0.1 `
    --chat-template-kwargs '{\"preserve_thinking\": true}' `
    --temp 0.6 `
    --top-p 0.95 `
    --top-k 20 `
    --min-p 0.0 `
    --presence-penalty 0.0 `
    --repeat-penalty 1.0

Comment by themanualstates 15 hours ago

That’s useless without describing WHY you chose those flags, and how you did the optimisation…

Comment by halJordan 14 hours ago

The switches are all in the -h of llama.cpp (although the maintainers have a tendency to use the word in its definition). The actual values are essentially just what alibaba recommends. So you just need their model card. I would not call it highly optimized, more appropriately tuned.

Comment by greenavocado 13 hours ago

I found every possible flag and its description including CUDA related environment variables and went back and iterated with Claude Opus 4.8 High until every single flag mattered above the temp one.

Comment by nateb2022 15 hours ago

I get over 100 tok/s sustained on my M4 Max and M5 Max, in MacBook Pro's. LM Studio + MLX.

Comment by boguscoder 3 hours ago

Same experience on M4 Max .. but quality of qwen still leaves so much to be desired after getting used to virtually unlimited tokens at work. Many people on this (and similar) thread seem to believe local models would inevitably improve, and I want to believe this too, but I don’t see this ever happening without growing in size

Comment by Terretta 15 hours ago

With Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf?

Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.

And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.

Comment by ridiculous_leke 15 hours ago

Can you comment on the quality and accuracy of it? People have managed to run Gemma 26b without GPU on old CPUs but I don't think quality is anywhere close to what Gemma 12b offers.

Comment by mattmanser 15 hours ago

That's a quant 4 which the thread OP specifically called out as rubbish.

The Q4_K_XL bit for those not in the know.

Comment by stymaar 15 hours ago

Anyone calling Qwen3.6-35B-A3B-Q4_K_XL “rubish” has no idea what they are talking about.

Comment by embedding-shape 14 hours ago

I'd agree that the quality degrades a lot between Q8 and Q4, borderline unusable as they start to fail with tool calling syntax even. Personally I'd say Q8 is as low as you want to go.

Comment by c0rruptbytes 14 hours ago

q4 isn't rubbish, but it's a compromise for a good value, q6 is essentially a no-compromise quantization and it's what i recommend for MoEs in my experience for agentic workflows

Comment by greenavocado 15 hours ago

He's probably calling me out for this comment https://news.ycombinator.com/item?id=48557579

Comment by greenavocado 15 hours ago

I typically find myself using a context of between 150-500k with GPT models so local models are simply not enough and I stopped using them.

Comment by stymaar 15 hours ago

That's way higher than their optimal ceiling (and absolutely suboptimal from a token cost point of view), why are you doing that?

Comment by greenavocado 15 hours ago

You're 100% right and its even severe than that: I daily drive on xhigh. I really try to avoid it, but when reconciling APIs across two large codebases you really start pressing north of 200k. I find myself topping out at 800k sometimes and that's with careful context management. I actually had to drop to GPT 5.4 for 1M context in my subscription because GPT 5.5 tops out at 272k. Hitting 800k context is better than repeatedly hitting let's say 200k out of 272k with multiple rounds of compaction. I run Can's snapcompact and while its better than normal compaction it still lobotomizes the model more than running with a very high context window.

Comment by c0rruptbytes 14 hours ago

large contexts degrade the performance - attention doesn't work will for large windows like that and cloud models are kind of hacking it

local models do involve some context engineering to get it okay, but it's not that rough

Comment by xlii 57 minutes ago

> I use a lot of local models and they're still pretty painful to run locally.

This really depends on how and what you're using. e.g. I can't suffer through slowness of inference on Macbook but I have gaming rig with quite powerful GPU and I squeeze ~130 t/s on Gemma or ~70t/s on Qwen.

Tuning is not optional as well. Qwen on temperatures > 0.5 is unusable for coding and I found sweet spot around 0.32 for coding. Speculative decoding on Gemma4 26B is a 30t/s difference between non-speculative.

The worst thing with local models is that I can't just give you a recipe, because what's the best params depends on your use case.

In the nutshell I'd compare local models to running game rig on Windows vs Linux. Linux works great if not better than Windows gaming, but you need to embrace some tweaking in order to get there. Is it there? It's not SOTA, that's for sure, but it's working reasonably well.

Comment by adam_arthur 16 hours ago

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.

Comment by dstryr 15 hours ago

This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.

Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...

Comment by adam_arthur 15 hours ago

I'm talking about automation generally, not agent loops.

E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.

Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).

Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.

Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)

Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)

Applies to other rule following as well in my experience.

Qwen may be better at toolcalling and certainly probably codegen.

It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.

Comment by trouve_search 15 hours ago

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

Comment by adam_arthur 15 hours ago

Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.

You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.

But I'd take the simplicity of a single thread and higher throughput personally.

Overall of course still better to wait for next gen devices if you can.

Comment by diddid 10 hours ago

With the 5090 you need to buy the rest of the computer though, and the Dgx spark will run 1/4th as slow but use 1/5th the electricity. And the spark would be able to run things the 5090 just couldn’t, like the Qwen3.5 122b. Which is all just to say that for llm workflows there is no easy answer. And if you media generation it gets even more complicated.

Comment by ozim 13 hours ago

I was expecting DGX Spark to run Gemma 31b Q4 much faster.

I was expecting it would run Q8 in 50 tok/s.

I guess that’s good I stopped thinking about buying it because I would be disappointed.

Comment by girvo 10 hours ago

I love my Spark-alike, but they really aren't inference boxes IMO. They're experimentation boxes. A couple of 3080 20GB's for cheap from China, a 5090, an RTX Pro 6000 if you can swing the horrible cost: those are better choices IMO

That said, I'm still running Step 3.7 Flash at ~40tk/s decode, 1000tk/s+ prefill on mine and its both very capable and fast enough

I got Gemma 31b to run on this at ~22tk/s decode at FP8 using MTP

Comment by gopher_space 15 hours ago

In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.

If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.

Comment by msp26 15 hours ago

Yep agreed completely. I couldn't imagine torturing myself with a small model for local coding. But Gemma 4 31B is so fucking good for a variety of language modelling tasks.

Comment by freehorse 13 hours ago

> You have MoE models (gemma 26b, qwen 35b, north mini code 30b) who are pretty fast, but make a lot of mistakes

This is sadly also my experience. I wish we had some MoE models with a higher ratio of active parameters per total. My experience is that the newer MoE models that can run in a 64b laptop have too few active parameters to be useful outside narrower, specific tasks. Mixtral 8x7b was a 14b active parameter (56b total) MoE model a few years ago and was probably the best model one could run in that range for some time, but it is too old now.

I have been using the qwen 27b and it is great, but running a dense model like this in a macbook is a bit suboptimal, and i wish I could run sth faster than 15 tok/s.

Comment by c0rruptbytes 13 hours ago

I would try a 6-bit MoE and maybe with unsloth's studio, they claim to have auto tool fixing which is where i see a lot of issues with MoEs

I'm on a 48gb M5 Pro right now and it's been okay, a lot of my rough experiences have been with MLX and I'm finding that GGUFs are okay now

Comment by hnlmorg 14 hours ago

To be honest even the cloud models are a hot mess at times. This week I’ve spent more time rejected code from OpenAI models than I have approving it.

In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)

Comment by Stagnant 13 hours ago

I've been using unsloth/gemma-4-31B-it-qat-GGUF daily for various small parsing and programming tasks using opencode and llama-server's front end. The past couple of weeks have made a big difference after google released the QAT variant and llama.cpp got support for MTP which means it is possible to now get 60-80 Tok/s with RTX 4090. The model fits in VRAM comfortably enough to keep it loaded even while browsing and having multiple programs.

Comment by amdivia 12 hours ago

110+ Tok/s as another data point on the RTX 5090 (Gemma 4 31B QAT + MTP at UD-Q4_K_XL) (at peak used 27 GB of vram)

The real lovely thing was getting 300+ Tok/s (Gemma 4 26B QAT + MTP at UD-Q4_K_XL) (at peak, I think I saw vram usage reach 21 GB of vram)

Comment by lmedinas 11 hours ago

the problem of that setup is that it will run out of context pretty quick. So for coding agent it will limit your workflow very fast.

Comment by andy_ppp 10 hours ago

I wonder if it is better to have a machine somewhere running a model for you maybe shared with a few others. I could probably justify a M6 Mac Studio with hopefully 256gb RAM and have a few people all with access to one agreed upon model. I think maybe laptops are too warm and clunky for this.

Comment by wgd 10 hours ago

The problem is that the moment you introduce shared remote hardware there's a slippery slope leading right back down to "just pay an inference host for model tokens". If you're transmitting your prompts over the internet to a trusted host you might as well just let that host be DeepInfra or together.ai or one of the many other providers already in that business.

Comment by andy_ppp 9 hours ago

I dunno, I probably need the web to be able to do work so why does it matter - taking the simple case - of running just myself on a Mac Studio at home or cooking my self on the go I'd probably rather have a cheaper laptop and dedicated hardware. I think for many this is about having control over the model and not about farming things out to a SAAS... what does the saying say opinions are like again.

Comment by not_kurt_godel 9 hours ago

I had some local model FOMO, trialed for a few days, and tentatively arrived at the same conclusion. I can get a better ROI on the time I spent waiting and dealing with poor quality by just programming by hand myself instead.

Comment by devilsdata 10 hours ago

Just to piggyback onto this comment; has anyone tried running multiple of these in conjunction? For example, having a Python script that has one of these orchestrate others, and offloads certain tasks to better/more powerful models, or even cloud models?

Comment by pizzafeelsright 10 hours ago

yes but then that defeats the purpose of 'local'

and if remaining local, the hardware required to run multiple poor models could be better spent running better models.

I have attempted to orchestrate using different models, loading and unloading, but the speed is not there and by the time mistakes are discovered considering the lack of quick iteration the results become worthless unless the task is trivial.

Comment by heipei 16 hours ago

Depends on what you mean by "local". On your Macbook, large dense models like Qwen 3.6 27B will be slow, sure. On a local workstation with a dedicated RTX card you can get > 100 tps, which is more than good enough to work with it, and faster than cloud models in many cases.

Comment by jstanley 16 hours ago

But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

I don't care how many tokens per second of nonsense it can generate.

Comment by throwawayffffas 15 hours ago

Qwen 3.6 35b a3b is about as good as sonnet 4.5. It varies but it's at that level.

Comment by notnullorvoid 15 hours ago

Quantized Gemma 4 26B is as smart or better than GPT 5 in most of my testing. Granted GPT 5 is nearly a year old at this point, but I can run Gemma 4 on a ~6 year old consumer GPU (RTX 3090) and get 140 t/s.

Comment by heipei 16 hours ago

It is smart enough that I use for all my coding tasks, and a lot of other mundane tasks.

It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.

My suggestion is to simply try it and see what it feels like.

Comment by lelanthran 14 hours ago

> But how smart is it? All the people running local models never seem to mention that they are way dumber than cloud models.

Well, you aren't going to give it a 20k line sec and have it churn out a full app after 4 hours hours.

But, you can get it to write code for you if you do the design.

Comment by myaccountonhn 16 hours ago

Its not going to be as good as Claude, but if you know what you're doing, it may be good enough to get your work done.

Comment by data-ottawa 16 hours ago

This is task dependent.

I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.

Comment by garciasn 16 hours ago

A highly skilled carpenter may be able to 'get work done' by banging nails in with a heavy-bottomed cocktail glass, doesn't mean it's not painful to do so when it is continuously breaking and leaving shards of glass all over the workshop for you to find every day for the rest of your life until you clean up the mess you made using the wrong tool for the job.

Comment by sgt101 13 hours ago

If someone comes into the workshop and takes all the tools (hello Donald) then having a cocktail glass to hand might be a bit of a lucky break.

(geddit?)

Comment by CamperBob2 15 hours ago

More like, a highly-skilled carpenter can work miracles with a $6 hammer from the hardware store, while the pros on the commercial crew are using fancy compressed-air tools.

The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.

Comment by c0rruptbytes 15 hours ago

I'm talking about the common use case that I think hacker news people have:

you get a macbook for work, you run the macbook

they're not going to start giving GPUs to employees to run local models

Comment by FuriouslyAdrift 15 hours ago

Kimi 2.6 or 2.8 is what we are playing with locally. They need 512GB to 1TB to run with full capabilities so that's not exactly "desktop"

Our GPU computer server cost $110k.

Comment by abalashov 3 hours ago

But boy, it must be glorious. I use these through OpenRouter and rarely bother with Claude anymore.

Comment by beadw 10 hours ago

I think you’re spot on. In my experience people confuse a models ability to solve some benchmark as a sign of its usefulness. Token throughput is often just as important from my personal usage. I am excited for more diffusion models to see how progress happens there.

Comment by peterlk 10 hours ago

Yes to diffusion models! Combo pipelines of generative and diffusion models have super interesting potential

Comment by ridiculous_leke 15 hours ago

A median laptop is no bueno for running a reliable model(which will be qwen 27b as per my reading here and r/localllama). Powerful macs would be prevalent in certain areas of the world but in rest of the world personal machines aren't always that powerful.

Comment by smcleod 11 hours ago

Those dense models are pretty fast with MTP now. 40-70TK/s depending on your machine, that's faster than cloud models (although not as smart obviously).

Comment by EnPissant 2 hours ago

When running on a GPU, dense models are shaping up to be the best way due to two things:

- Maximum intelligence per VRAM (you dont have much)

- Dense models can benefit from MTP to get an almost 2x speedup in decode (ie, a 27b dense model with mtp decodes at about the same speed as a MoE model with 14b active param model would). This is important because local llm rarely has parallel streams to batch together.

When running on large unified memory like Strix Halo or Spark Dgx, MoE models are usually best:

- You can get similar intelligence as a smaller dense model with fewer active params (to compensate for the slower memory) by throwing ram at the problem.

Comment by NamlchakKhandro 2 hours ago

Pi mono is king. Everything else is hypetrash.

If I can't customise it then I won't waste my time using it it getting use to it.

Claude code is trash, it's customisability is extremely shallow, open code, codex, copilot, Kiro, etc etc... all trash. Yes even open code..

If open code was so awesome then open claw would have been based on it... But it wasn't. That's should tell you everything you need to know.

Comment by locknitpicker 2 hours ago

> I don't know about good, I use a lot of local models and they're still pretty painful to run locally

You are somehow assuming cloud-based models are not painful.

I can tell you my past experience. I was using GPT 5.5 and Claude Opus interchangeably and I prompted them to implement a feature. I paid attention to the agent window and it was literally screwing up implementations, causing tests to fail, and going into test-fail-fix loops to clean up after itself. After a few minutes, it finally called it done. That run cost $0.60.

I went to review the code and only half of the source files complied with the instruction files. I prompted the model to clarify why it failed to comply with the instruction file. The model outputs "you are right, I should have complied with the instruction files. That prompt cost $0.30.

I prompted the model to proceed and apply the instruction file prompts. It went ahead and applied changes. Success. It cost $0.16.

I reviewed the code again. Only half of the sloppy code was touched up. I prompted it to fix the whole mess, not just a couple of files. It complied. One coin less in my purse.

So, around a third of the cost of a feature is spent on the model cleaning the mess it left in it's wake.

And this was a tiny feature with a plan, a solid set of instruction files.

Very expensive.

Are costs going down? I doubt so. OpenAI seems to still be spending 3 times it's revenue already.

In comparison, local models sound very good.

Comment by robomartin 13 hours ago

> On top of that, your laptop becomes a loud hot churning machine, it's uncomfortable to work with.

Laptop?

OK, I've made that mistake before. I understand modern laptops are powerful, but nobody wanting to do serious AI/ML work should be using a laptop for anything other than SSH or similar low-performance access into a proper system.

Years ago I fried two laptops just doing finite element analysis work running 18+ hours per day. It was one of those "I'm giving you all she's got, Captain!" workloads. They fried, even with powerful fans cooling them. I should have known better. Such workloads belong on purpose built systems.

Comment by atomicnumber3 14 hours ago

I largely don't disagree with you but come to a different conclusion. I have two systems:

1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram

2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.

I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.

The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.

I find with both models that:

- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"

- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)

- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed

- they both cannot really be given a large ish task and left to just drive it on their own

The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.

I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that). And while claude might have slightly higher velocity allowed while remaining suborbital, it's very diminishing returns.

So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.

Comment by everdrive 16 hours ago

What counts as a lot of memory? What could someone do with 16 GB of RAM?

Comment by throwawayffffas 15 hours ago

Not much, the capable models won't fit unless you go with very low quantization but that leads to a lot of loss.

You generally want to run q8 or some kind of "6bit" quantization at least.

40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.

Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.

Comment by zozbot234 15 hours ago

Modern inference engines can stream in weights from SSD in order to save on RAM, but this makes inference very slow, especially for the trivial single-session case. (Jury is still out on whether batching multiple sessions together can mitigate this well enough, but even then that's mostly helpful for the "running lots of inferences overnight and getting fresh results first thing in the morning" case. Which is interesting (the big third-party suppliers don't really offer a way of doing this at reasonable cost) but a bit of a niche.)

Comment by abalashov 15 hours ago

Not a ton. I'd say 64 GB minimal to play, 96-128 GB better.

Comment by throwawayffffas 15 hours ago

Nah, you can run the 24b - 35b class with between 90k and 256k of context with about 40GB and they are pretty good. Especially the MOE variants fit neatly in 40GB.

Comment by abalashov 11 hours ago

Yeah, but then you need RAM for the rest of your OS and applications. I'd say 64 to be comfortable in the sense to which most HN users are accustomed.

Comment by throwawayffffas 18 minutes ago

Sure sure, if you plan to run it on system ram instead of dedicated gpus then yeah you need an extra overhead to run your own stuff.

Comment by ValdikSS 15 hours ago

Gemma e2b, Gemma e4b. It's made for smartphones basically. You can run e2b with 8GB RAM.

Comment by trouve_search 15 hours ago

gemma 12B 4bit quant; try something with MTP and an AWQ quant

Comment by monegator 15 hours ago

gemma runs pretty well

Comment by greenavocado 16 hours ago

4 bit unsloth quants are good if you never ask for more than 20k context, use it as autocomplete on steroids, and never delegate serious questions to it

Comment by 13 hours ago

Comment by iwontberude 16 hours ago

They are good if you were clever enough to buy a powerful enough rig before memory went up. For everyone else I say just wait. M1 Ultra 128GB and higher is sufficient to run gemma4:31b-mlx or qwen3.6:35b-mlx with subagents. It’s only slow if you don’t know how to plan your work effectively.

Comment by dominotw 16 hours ago

maybe painful if you are using it like a chatbot. you are sitting there waiting for response. vs ambient ai like automatically classifying your family pics and discarding random things like parking floor number pic.

i use it usecases like that latter and they are fine.

Comment by citizenpaul 13 hours ago

They are still terrible at tool usage which loses 99% of the effectiveness of the agent. I've had to concede and use paid frontier models that can use tools or its not worth using agents....copy...paste....copy....paste....

Comment by iwontberude 10 hours ago

Your models aren’t big enough and they are forgetting about the tools. Try a larger model. If you can’t, then your rig was too underpowered anyways.

Comment by DiabloD3 3 hours ago

[dead]

Comment by iLoveOncall 1 hour ago

[dead]

Comment by hypfer 17 hours ago

After having been a happy user of Qwen3.6-27B for a few weeks, due to being away from the hardware, I'm currently forced to use Claude Sonnet 4.6

It is such a downgrade. I don't understand how that's even possible. The thing has so many strongly-held opinions I did not ever ask it for, talking just way too much and generally feeling somehow dumber.

Of course, being significantly larger, it will encode more knowledge, but that doesn't help me when I hate talking to it. And all that on top of the fact that talking with it costs real money.

I wonder what it might be that makes me hate it so much. Maybe because it doesn't see itself as a tool but almost an equal? As if its opinions would have weight.

Qwen too can act like an overeager intern, but if you tell it that it is an idiot, it will drop that ego. Not so much with Claude. In my experience, anyway.

Anyway, point is: full ack on that headline.

Comment by ggerganov 16 hours ago

I haven't spent a dime on cloud inference, so cannot make a direct comparison like you. But I can 100% attest to the fact that Qwen3.6-27B is a very capable local model for coding tasks. Over the last month and a half I've been using it almost daily, either on my M2 Ultra or on my RTX 5090 box. I use it for small mundane tasks at ggml-org [0] - nothing really impressive, but definitely a helpful tool for a maintainer. I think I would be using it much more, if I didn't have to spend a lot of my time on reviewing PRs. Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style. About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac. I definitely prefer running it on the RTX machine - it's so much faster. But for the sake of testing and getting wider experience with local configurations, I often run it on the Mac too.

[0] - https://github.com/search?q=%22Assisted-by%22+user%3Aggml-or...

[1] - https://github.com/ggml-org/llama.cpp/blob/master/.pi/gg/SYS...

Comment by trilogic 16 hours ago

I also confirm that local inference is on par with proprietary cloud services (with a bit of local setup, simple agents.md and some utils skills). This local models come with tools, that's mind blowing, considering that some months ago we had to .md tools ourselves. What makes a model worth even more is "Memory". We implemented that long ago. Last time I used proprietary services was 3 months ago, don´t really need it, my subscription is going blank.

Gerganov, hope you will consider developing further the CLI cause we suffering with the server.

Comment by jayGlow 15 hours ago

what are you using for memory with your local models? is there a specific harness you would recommend for local agents?

Comment by mft_ 12 hours ago

I’m using Hermes at the moment - it comes with lots of tools already baked in for the agent to use - for example web and browser access just worked, rather than having to mess around loads with config scripts and plugins.

I’ve also tried OpenCode (similar but a bit less so) and Pi (fast but you have to add lots of features yourself which is a bit of a pain). Claude Code can also be pointed at a local model and works, but the default system prompt is huge. (~140k of text when I extracted mine, IIRC.)

Comment by vorticalbox 10 hours ago

[dead]

Comment by trilogic 14 hours ago

I use HugstonOne (that backend a personalized version of llama.cpp). Implemented it´s own double layer memory that recall the full or partial previous session/file with an ON/OFF switch (which picks up where left off in CLI or Server or both same time) and another that reads back a % of current tab if memory switch is off doing checkpoints every certain tokens, summarizing and referring back to it when needed (recalled by certain logics). There is more to it when involving local RAG (making it tripple memory layer) but thats a long story.

About the harness depends on for what you need it, but basically for a universal unit of measure, Harness is multilayered and logic and domain specific dependent. I would definitely include Type of Hardware, Model parameters/knowledge, Model Intelligence, Model size/context, type of conversion, type and quantization (models comes with some default tools), but adding your (domain specific), skills, tools, memory, logs, security, Rag, Online search... (which as scary as they sound are mostly simple logics in a txt file, like if this do that).

The full pack is Harness 10, every missing thing lower the harness score.

To answer to your question I would definitely recommend smth like HugstonOne (or anyway llama.cpp CLI) with Qwen 3.6 35B finetuned/distill (deepseek 4 or claude 4.7) with none of the current coding agents out there that are screaming internet connection and proprietary API and data collection. DO this, if you can find a tool that you can download and choose a local model (of your choice in whatever folder locally) and load it ready for inference without any need of internet connection that is the tool you should aim for. Right now there is none out there.

Comment by kpw94 16 hours ago

> About the generation speed: ~100-150 t/s on the RTX 5090 and ~40 t/s on the Mac

Curious if you can share the prefill speed too?

I run locally on a crappy desktop (some AMD iGPU with Vulkan llama.cpp, 32 GB DDR4 RAM) for experimentation. I get 15 tok/s on generation for the qwen & gemma4 MoE models. I get around 150 tok/s prefill speed.

Reason I'm asking about the prefill is looking at my stats at work, I use between 20M to peaks of 300M input tokens daily. Some of those token are cached but in general, I seem to have roughly 500x more input tokens than output. So interested in prefill tok/s stats.

Huge Thank you for llama.cpp btw!!

Comment by ggerganov 15 hours ago

Here are the prefill speeds:

    Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB
  | model                          |       size |     params | backend  |  fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | --: | --------------: | -------------------: |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |   pp2048 @ d512 |      3714.02 ± 10.85 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d1024 |      3684.86 ± 15.21 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d2048 |       3650.80 ± 8.53 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 |  pp2048 @ d8192 |       3473.88 ± 0.97 |
  | qwen35 27B Q4_K - Medium       |  15.92 GiB |    27.32 B | CUDA     |   1 | pp2048 @ d32768 |       2754.69 ± 4.07 |

  ggml_metal_device_init: GPU name:   MTL0 (Apple M2 Ultra)
  | model                          |       size |     params | backend  | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | -------- | -: | --------------: | -------------------: |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |   pp2048 @ d512 |        379.75 ± 0.21 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d1024 |        377.15 ± 0.35 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d2048 |        371.46 ± 0.91 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 |  pp2048 @ d8192 |        344.84 ± 0.41 |
  | qwen35 27B Q8_0                |  26.62 GiB |    26.90 B | MTL      |  1 | pp2048 @ d32768 |        222.42 ± 5.29 |

Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.

Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster.

[0] https://github.com/ggml-org/llama.cpp/pull/19164

Comment by kpw94 15 hours ago

Thanks! Super helpful.

I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode)

At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc.

It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.

Comment by girvo 10 hours ago

> Currently, I have a very lightweight harness - the pi agent with everything stripped (`pi -nc --offline`) and a short system prompt [1] to align it a bit with my style

This really is the secret to getting the most out of these models IMO. Pi is so damned good. I have a strongly tuned Pi for running Step 3.7 Flash (IQ4_XS) and Qwen 3.6 27B (FP8)

Also, thank you for llama.cpp mate :)

Comment by androiddrew 9 hours ago

I have never heard of step 3.7 flash. Why do you like it? What rough spots have you encountered?

Comment by celrod 16 hours ago

What quant do you run it at? 32GB seems like cutting it close on the rtx 5090 if going 8b, but other commenters are saying 4b lobotomizes the model.

Comment by ggerganov 16 hours ago

As a baseline, I run all models in Q8 [0] because I want to be confident that when I observe a problem, the root cause is not due to the quantization. However, in this specific case, I use Q8 on the mac and Q4 on the RTX machine because the latter does not fit the full context at Q8. So far, I don't have conclusive evidence that the Q4 quantization affects the quality in a significant way for this model and the tasks that I am using it for.

[0] https://huggingface.co/ggerganov/presets/blob/main/preset.in...

Comment by girvo 10 hours ago

27B seems surprisingly resiliant to quantisation. Though my evals showed there was some impact to coding ability from 8 bit to 4 bit, it was less than I would've expected: and it was on task types that you've said above that you don't really do with these!

Comment by 15 hours ago

Comment by toddmorey 14 hours ago

For the curious, it looks like a PC with a RTX 5090 32GB graphics card will run you about $6,000.

Comment by fridder 16 hours ago

Not too shabby. I like the regular Qwen but prompt prefill on my m1max is slow as hell

Comment by StevenWaterman 17 hours ago

Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)

Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization

Comment by indoordin0saur 17 hours ago

> And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.

Comment by amoshebb 17 hours ago

This is, as far as I know, the business model of coys like mistral and cohere

Comment by suncemoje 16 hours ago

On-premise (1960-2010) -> Cloud (2010-2026) -> On-premise (2026+)?

Comment by indoordin0saur 16 hours ago

I think that's overstated, but the loss of trust companies have with the big AI players is pretty serious. Not a big deal if your app is for sharing cat videos, but if you're medical or wealth management or a government contractor or the like enterprise clients really like to see good data security policies.

Comment by lelanthran 13 hours ago

> Not a big deal if your app is for sharing cat videos, but if you're medical or wealth management or a government contractor or the like enterprise clients really like to see good data security policies.

If this mattered to them, they wouldn't be running so much in the cloud or in proprietary software that they have no ability to air-gap.

If companies ever cared about this, Windows would not be dominant on the desktop.

Comment by indoordin0saur 12 hours ago

There are a lot of government jobs I know of that are absolutely air-gapped. Your computer has basically no internet access, everything is stored on-prem. Hedge funds also tend to be extremely locked down, from what I saw when I interviewed. With certain data sets either having strict encryption-in-transit or a being stored in a quirky on-prem service. I can't imagine they're going to be dumping their data into Claude, etc.

As to why Windows is so dominant, I'm as clueless as you.

Comment by suncemoje 2 hours ago

> There are a lot of government jobs I know of that are absolutely air-gapped. Your computer has basically no internet access, everything is stored on-prem.

I wonder if that's because they don't know better or because of a lack of trust or costs?

Comment by suncemoje 16 hours ago

Agree. I also wonder how zero e.g., Claude Enterprise ZDR really is, and what their data pipeline actually looks like.

Comment by cyanydeez 16 hours ago

I think the next step to anyone but overbloated USA models is to follow https://chatjimmy.ai/ with one of the qwen models. If they can mass produce something at relative cost, these would be awesome sidecars.

Comment by hughw 16 hours ago

Just this morning I tweaked my single 3090 setup too:

  OLLAMA_FLASH_ATTENTION=1
  OLLAMA_KV_CACHE_TYPE=q8_0
  OLLAMA_CONTEXT_LENGTH=180000

and that fits in 23GB.

[edited for format]

Comment by giancarlostoro 17 hours ago

> (starts to get a bit dumb above 160k ish)

If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.

Comment by StevenWaterman 17 hours ago

I think we'll get there. Right now it works for me, because I'm naturally pretty verbose in my prompts, and know the codebase well, so I know what it needs to look at. Plus subagents for anything exploratory.

I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know

Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.

In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)

Comment by zozbot234 16 hours ago

DeepSeek V4 (both Flash and Pro) has very good scaling of context length wrt. RAM use, so this is not an inherent limit of LLMs in general.

Comment by 0xc133 16 hours ago

With yarn and rope scaling arguments for llama.cpp you could run qwen3.6-27B with 1M context… if you have enough memory to store it.

Comment by cyanydeez 16 hours ago

I don't really think you're making reasonable decisions at that size; but I suppose if you're not allowed to refactor it, maybe.

I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.

Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.

Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.

Comment by QuantumNoodle 16 hours ago

Do you have any resources on hardware necessary for running models and tweaks? I see you mention 2x 3090 and I wanted to do more search on what hardware is satisfactory for what models.

Comment by iamtheworstdev 15 hours ago

are you running an NVLink? I have the same setup but no NVLink and it feels like it's best just splitting the 3090s to run separate models concurrently. But I also have no idea what I'm doing.

Comment by fluoridation 13 hours ago

It depends on what you're comparing. If the same model fits on the combined VRAM but not on a single contiguous VRAM, then it won't be faster to run two instances of it. If you're comparing a 23 GB model running duplicated vs a 46 GB model running split, then yeah, that will likely be faster, just because there's no synchronization between cards.

AFAIUI, there'd be little advantage in having a higher speed inter-card connection, because the cards don't really talk to each other during inference. The loss of efficiency compared to a monolithic memory architecture comes from scheduling, not from data transfer.

Comment by Andrex 15 hours ago

How long have you been using it?

Comment by epistasis 16 hours ago

> talking just way too much

OMG this is such an annoying property, just shut the hell up please, and be concise.

I suspect that this is an artifact of the thinking property, but please just summarize the thinking process far more concisely, where a single sentence answer is more than sufficient the frontier models seem devoted to going on to a minimum of 5 paragraphs and offering 3-5 new directions.

And requests to please only offer a single step at once, or single option at once, or to even stop eagerly offering future directions is really hard to prompt correctly.

And look, there I did exactly what I was complaining about...

Comment by bityard 16 hours ago

I'm not sure to what degree you can influence how a model thinks, but you can definitely hide the thinking tokens and tell the model how you want it to talk to you.

For example, the Claude web UI has an Instructions field where I have told it never to congratulate or praise me for asking questions. Earlier Copilot models used a ridiculous number of emoji and bullet lists when answering literally every prompt, I told it to knock that off and prefer detailed paragraphs in prose.

Local agents/frameworks/whatever all have their equivalents for overall user preferences.

Comment by epistasis 15 hours ago

Thanks for the reminder! For others looking for this setting, it is currently under User Menu (click your account name in the lower left), then "Settings", then the "General" tab there's an "Instructions for Claude" box.

Asking Claude for this provides incorrect instructions for me, so I'm guessing it moves around a lot.

Comment by illegalsmile 16 hours ago

That's why you have to give claude and others directives/.md at the beginning so it doesn't go off the deep end with suggestions.

Comment by epistasis 16 hours ago

Yeah, I've tried, and I'm sure somebody is going to say "skill issue" but it's not so easy to get the model to do that. Maybe it should be a SKILLS.md issue.

Edit: also, how can I stop the LLM from all this fake glazing, as if every question I have is some sort of unique genius insight, it's so damn annoying. I just got the third straight round of this while merely trying to get summarization of a PDF:

> Good question — it gets right at a real tension in the paper. Let me check the current state of actual SV-imputation efforts, since this has moved since 2020.

Comment by bornfreddy 15 hours ago

I didn't try telling to be concise and stop pampering me yet (but good idea, tomorrow), however I found that instead of me writing agent instructions, it works much better if I tell claude to write instructions for itself. I do check if they make sense of course, but its wording works much better than mine.

Comment by frereubu 12 hours ago

[dead]

Comment by derethanhausen 17 hours ago

I would not generalize based on experiences with Sonnet. The flagship models (Opus being the claude equivalent) are dramatically better.

Comment by hypfer 17 hours ago

Opus in my experience is equally unpleasant "character"-wise, but at least it actually gets stuff done more often, so it's at least slightly more earned at that. It's still a neurotic cargo-culting dogmatic idiot, but one that at least sometimes does produce deliverables instead of only bottom-tier HN-esque opinions.

Hmm. I think I might just fundamentally disagree with Anthropic about the idea of what a "tool" should be.

Comment by hypfer 2 hours ago

This morning I have been blessed with an example of the exact behavior that is so infuriating.

> But re-reading the comment:

> "In the real world however, it does not. Hence, in the future we might fail this check even if it works within this limited check."

> The comment says "we might fail this check even if it works" — implying the original intent was to always fail (return 1) as a conservative stance, leaving room to tighten it later. The new code does the opposite of what the comment describes.

> So either:

> 1) The comment needs updating to reflect the new intent, or

> 2) The logic is wrong and should still return 1 unconditionally if the original intent was preserved.

> Which is it — did you intentionally want to succeed when X works, making the comment outdated? Or should the return 1 be restored?

> User: did you see the "in the future"?

> Yes, fair point — "in the future we might fail this check" is forward-looking, not a statement of current intent. The comment is saying the door is left open to tightening the check later, not that it should currently always fail. So the current code is correct and the comment still holds. All good.

___

It reads/comprehends 5% of what is on screen but acts like it knows it all, bossing you around. Why is it like this. What's wrong with Anthropic and what's wrong with people _not_ getting incredibly angry at that unearned smartass.

Comment by KaoruAoiShiho 13 hours ago

Fable largely fixed the annoying chatterness so sucks that it's gone now.

Comment by kitd 17 hours ago

Funny that coding agents have personalities, including "that colleague" you want to avoid even if you know they're probably quite good at what they do!

Comment by otabdeveloper4 12 hours ago

That's exactly what RLHF is for.

(In fact, "that colleague" might have even been the source of the RLHF training set.)

Comment by linuxhansl 7 hours ago

How qwen3.6:27b compare to qwen3.6:35b-a3b (MoE) in your experience (if you tried). I find the dense models are way too slow on my H/W.

Comment by hypfer 2 hours ago

I don't really like them tbh.

They are a lot faster, yes, but at least for my needs, they feel a lot dumber too.

It's a shame that the gemma4 that _would_ fit 24GB VRAM is MoE. :(

Comment by radium3d 17 hours ago

If you think about it, they're splitting the power across millions of users. Essentially, these AI companies have YOUR hardware that YOU are paying (them) for in a cabinet at some data center. This means the hardware could easily be run locally for inference for these 'big' models. It's just a problem of dynamics-- RAM is being bought in bulk by these companies through these B200 style cards, instead of sold slowly through the open public markets.

This is likely due to a combination of mass funding for the AI companies, but also they are trying to governmentally restrict which countries get access to these cards so certain countries get a head start. The only way to lock that down is to have them literally locked in their own GPU prisons (data centers). Third reason is it does make it possible to train the models faster by having them in the same data center connected directly. Having them distributed to everyone would slow down training considerably.

The current way to 'own' decent RAM and GPUs right now is through the stock market it seems.

Comment by giancarlostoro 17 hours ago

There's a model on Huggingface where someone takes Qwen and makes it think Opus style, and that one seems to be decent, not sure if they have the 27B variant in that style. I do wonder if you can tweak your system prompt to force Qwen to behave better?

Comment by StevenWaterman 17 hours ago

You read the OP backwards, they said Sonnet is a downgrade from Qwen, and prefer Qwen's tone

Comment by giancarlostoro 16 hours ago

Sure, but my argument still holds, the idea is that Qwen reasons the way that Opus on High (what is now Max or whatever?) level thinking to reason about problems instead of its standard approach.

Comment by whythismatters 17 hours ago

Yes, Qwopus :) I've been pleasantly surprised by its quality

Comment by giancarlostoro 16 hours ago

Seen that one too, same guy I'm thinking of too, havent had a chance to try all of their models. For anyone curious I believe the username is Jackrong on huggingface? They've got several models out on there each focused on programming from different approaches.

Comment by MostlyStable 17 hours ago

Curious if you have tried custom instructions. I was never quite as unhappy with Claude's voice as you appear to be, but there were several things I didn't like. A custom prompt fixed almost all of them.

Comment by clickety_clack 17 hours ago

I think it would be very hard to convince someone to pay $100/mo to go back to Claude if they have a local model up and running, particularly now that model improvement has basically been stalled for the last 6 months. It’s so easy to set it up for yourself now too with things like LM studio. That said, there will always be unsophisticated users who can’t figure it out, so there will always be someone there to pay.

Comment by MostlyStable 16 hours ago

The person I was replying to specifically said that the Claude will "encode more knowledge" and that their problem was that they didn't like talking to Claude. It sounds like they think that Claude is at least slightly more functional. And the "not liking talking to it" is probably fixable. Someone for whom a local model works, and for whom the economics make sense, should absolutely run a local model and I wouldn't try to convince them otherwise. I'm sure it's the right choice for a lot of people. But not liking the personality of Claude is probably not a great reason on its own, given the minuscule amount of effort it takes to fix.

Comment by Scoundreller 17 hours ago

The third category are the occasional users that won’t have the hardware and won’t stomach a monthly fee for “unlimited” but are happy to pay-per-use.

I’d think the volume for that category would be low but LLMs aren’t just for coding.

Comment by dghlsakjg 16 hours ago

I’m probably the third category. I like experimenting and trying different models and techniques. I want api access for my own apps and Claude subscriptions don’t have that.

Sure I could splash out a ton of money for a high ram Mac, but deepseek is so dirt cheap that I think depreciation on a high end machine costs more than my api spend.

Example of what I’m using it for: building a semantic database of podcast content (podcast discoverability sucks on an episode level). I need a cheap LLM, an embedder, a transcriber, none of which Claude will do.

My api costs for coding agents plus running apps are about ~$20/month, but I get more than just chat + Claude code.

If all I was doing was pumping an employers codebase through a coding agent, Claude would be the answer.

Comment by chrisweekly 17 hours ago

Not everyone has the right hardware.

Comment by clickety_clack 16 hours ago

I guess I’m thinking of the $100/mo users, for whom it’s probably possible to get the right hardware.

Comment by andix 15 hours ago

Sonnet is extremely overpriced. It's a good model, but not worth the money Anthropic charges for it.

Comment by 17 hours ago

Comment by dackdel 16 hours ago

what kind of hardware do you need in order to run qwen3.6-27b

Comment by giancarlostoro 16 hours ago

Depends on which variant you pull down, but a single 5090 GPU (I know these are insanely expensive, but for context) could run either the Q8 or Q4_K_M version. It will not fit the 52GB version (BF16) on the other hand. So any modern Mac with a Pro or better processor and more than 52GB of RAM (don't forget VRAM for context window also matters!) would suffice, as someone else noted, probably a 128GB model would do the trick, and give you enough wiggle room to max out the context window.

My Mac only has 16GB of VRAM (20GB total - 8 is reserved for the OS) so I have to leave room for VRAM, I usually find a model that fits in 5 to 7 GB of VRAM and then max the context window as much as I can.

Comment by daemonologist 13 hours ago

The benefit of running the full precision version is negligible (probably not even measurable above the benchmark noise floor). Most common for cost-conscious users is to run something around 4-6 bits per weight, which would fit on a 24 or 32 GB card (as you mentioned).

Comment by pixelesque 14 hours ago

Note you can change the amount of shared (V)RAM reserved for the OS with:

sudo sysctl iogpu.wired_limit_mb=18800

will allow you to use more, but you do need to leave a bit for the OS obviously!

Comment by giancarlostoro 14 hours ago

Oh man! I had no idea I could do this at all! What do you usually tweak it to? I feel like 8 GB is probably still a reasonable amount to give the rest of the OS.

Comment by pixelesque 13 hours ago

I've got a 32 GB MBPro, and I set it to 27700, which I haven't seen a problem with so far.

Comment by dackdel 4 hours ago

thanks

Comment by iagooar 16 hours ago

I recommend MacBook M5 Max with 128 GB of RAM to run it comfortably and fast. If you have something like a regular M4, go with qwen3.6-35b-a3d - the Mixture of Expert architecture makes it run 2-3x faster than the 27b version.

Comment by dackdel 4 hours ago

thanks

Comment by sbmthakur 16 hours ago

I could run it on 7900 XT with 64k context. You could run it more comfortably on a 24 gb vram.

Comment by dackdel 3 hours ago

thanks

Comment by indoordin0saur 17 hours ago

Very curious what hardware you're running this on!

Comment by hypfer 17 hours ago

The same 24GB VRAM RTX 4090 I bought to play Cyberpunk 2077 with.

Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.

Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/

Comment by indoordin0saur 17 hours ago

Nice! Do you do anything with that compute when you're not actively using it? Is the crypto-mining hobby still worth it? I've also wondered if such expensive hardware can be rented back out to offset cost. Looks like these cards are going for as much as $4k nowadays.

Comment by all2 16 hours ago

There are services where you can hook your card up and rent it out to other users. I don't know what any of them are called, but they do exist.

Comment by dghlsakjg 16 hours ago

Salad.com is one. (I’m unaffiliated, just happened to come across it this week while looking for a cheap option)

Comment by hypfer 16 hours ago

I've paid ~2k€ in 2023. Since I'm usually sitting next to it, I'm only using it when I want to use it. It can get quite loud and warm.

Crypto (to my knowledge at least) moved away from GPU mining. I guess you could maybe rent out GPU compute, but - being in germany - it's not worth the legal hassle. You could of course always commit tax fraud, though I wouldn't recommend that.

Comment by esseph 16 hours ago

> I've also wondered if such expensive hardware can be rented back out to offset cost.

Massive legal liability. Not worth it.

Comment by Rzor 10 hours ago

Can you fix MTP-GEMMA-4-26B-A4B-IT? It says the weights are 0.5 GB in size.

edit: nvm, I'm confusing models.

Comment by 10 hours ago

Comment by cdelsolar 17 hours ago

What did you call me?

Comment by zerd 16 hours ago

I noticed Fable was quite a bit terser, and I think it's due to changes in the system prompt [0]. They're literally saying "just give me the TLDR" and "give brief updates". You can tweak a lot of that with an AGENTS.md.

[0] https://twelvetables.blog/comparing-claude-fable-5s-system-p...

Comment by chrisweekly 17 hours ago

Why Sonnet 4.6 not Opus?

Comment by ltononro 16 hours ago

Well but comparing with sonnet 4.6 instead of opus 4.6,.7 or .8 doesnt make a real point I mean, pay 200 USD/month (if you have that cash, or your company has it), might not justify using local at all (unless you have some reason to suspect about data leakage)

Comment by dyauspitr 14 hours ago

Why would I want some half assed coding assist tool. I want something that takes in a requirement and spits out a finished product. It’s not your equal, it’s better than you.

Comment by calebm 15 hours ago

sync/ack

Comment by cmrdporcupine 15 hours ago

The Anthropic models have always been annoying this way -- chatty/opinionated and Dunning-Krugerish. And love to run away and do things unprompted with me jamming my ESC ESC ESC key over and over so I can get a word in edgewise.

FWIW Codex/GPT models are way less this way. Maybe to a fault.

I'm setting up my DGX Spark to try Qwen 3.6 27B again, as I'm hearing a lot of good reviews. When I tried it some time ago it was still early for support in llama.cpp.

Comment by rmunn 17 hours ago

This is the kind of thing that Anthropic et al should be worried about. As it becomes easier and easier to run local models, the ceiling of what they'll be able to charge will get lower and lower. Not that nobody will be willing to pay $$$$$ per month, but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

Comment by sathackr 17 hours ago

The opposite of that has been happening for 20 years now with cloud compute.

It won't happen with AI models either.

It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.

Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.

I'm in a relatively small business, we recently had an outage related to our local infrastructure.

I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Everyone wants to shuck the chore and the responsibility.

Comment by preommr 16 hours ago

> The opposite of that has been happening for 20 years now with cloud compute. It won't happen with AI models either.

AI is different.

Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.

In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.

The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.

And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.

Comment by codethief 14 hours ago

> AI is different.

I agree. The other thing here is that, once you can run LLMs on a single piece of commodity hardware (whether that includes one GPU or several), the difference between cloud vs. on-premise LLMs will largely be about where your hardware is located. There will be very little software configuration involved (just an HTTP endpoint that talks to the GPU). This is decidedly different from cloud products where the moat of hyperscalers is largely in the software and services on top of the hardware, not the hardware itself. (Sure, GPUs will eventually break & need replacement, too, but there's no state to lose, so that's already orders of magnitude easier than replacing hard drives.)

Comment by rmunn 4 hours ago

There's also a difference in the cost of downtime. A server hosting your website or SaaS, if it's down for five minutes, costs you a lot of real revenue. So you plan for redundancy, you set up automatic failover so that if one node goes down the next node can handle the load while the first one reboots, and so on. But for the LLM that's just serving your local model? You can tell everyone "Hey, we're taking it down for a 15-minute window, so plan your lunch break while it's down". Unplanned downtime can interrupt what people were doing and cost you productivity and thus money, but it's a lot easier to schedule planned downtime and have people work on non-model-using tasks during those periods: the model is helpful, but not essential.

Comment by 15155 13 hours ago

> Cloud computing genuinely is cheaper on average.

For some applications, sure. Availability is a large part of what one is paying for with cloud computing, but it's also something that not every business needs.

If you sacrifice availability and have a pure-compute use case (low durability requirements), on-prem can quickly end up cheaper for far better hardware.

Comment by richardwhiuk 15 hours ago

There's no economic reason why running a model locally should be better than using a cloud hosted version.

Comment by moregrist 11 hours ago

“There is no reason anyone would want a computer in their home." - Ken Olson, Founder of Digital Equipment Corporation, in 1977

Comment by mcmoor 6 hours ago

In hindsight this is getting truer, what with the push of dumb terminal for everyone

Comment by RevEng 7 hours ago

You pay a 3x markup to rent a server through AWS than managing your own. You pay for convenience. At shall annals that's fine, but for large companies with their own datacenters, you generally do things in house.

Comment by spockz 14 hours ago

Sure there is. Keeping your IP in house.

Comment by TkTech 16 hours ago

For many companies (country-dependent) that's not really why they use cloud services vs purchasing. It's tax shenanigans and business process overhead. OpEx vs CapEx, and a small (%) bump in the huge AWS bill no one will even notice or a $30k+ invoice for hardware that has to go through rigorous review and 3 departments.

Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.

Comment by codethief 14 hours ago

Good point. Maybe there'll be companies that maintain your on-premise GPU cluster just like there are companies that service the coffee machine in your office?

Comment by mohamedkoubaa 10 hours ago

This is far more likely than everyone racking their own servers.

Comment by otabdeveloper4 11 hours ago

> on-premise GPU cluster

Renting a GPU server from a cloud and hosting your own llama.cpp is the path of least resistance.

Comment by wraptile 2 hours ago

> It won't happen with AI models either.

AI is definitely different. Cloud compute is incredibly convenient to the point where even if AWS is more expensive it's just so _nice_. LLM models are much more abstract and while I can't easily swap AWS for Hetzner to save 80% of my costs I can absolutely get close to that for many of LLM tasks, even today.

I suspect Anthropic and gang all know that that's why they are buying up dev tools and shifting towards long-running agents because that's where they can get AWS's "nicesness" that they can charge for.

Comment by dreambuffer 17 hours ago

It's just not comparable though is it? You need cloud services because it's physically impossible to use your single home computer as a server, CDN, load balancer, mass storage, security service, and distributed system.

But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.

Comment by pessimizer 14 hours ago

If you're a medium-large company, you should definitely run your own AI because you can max out the CPUs more often. You're not only able to run privately and locally, but you're also able to run efficiently.

Comment by frobisher 1 hour ago

I suppose cloud won because: - nobody wants to deal with the networking stack on the internet - you want servers alive all the time - it's businesses running their software on servers to serve to customers

Do these apply to AI?

Comment by cheema33 17 hours ago

> I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.

I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.

Comment by rapidfl 7 hours ago

There is efficiency in the cloud model for models. So maybe there is a scope for Apple or an "Apple for AI" in the AI compute game - mainly from the perspective of privacy etc.

And once the servers are in space, everything is fully out there.

Comment by Terr_ 14 hours ago

IMO local-vs-cloud may be a misleading dichotomy, versus:

    1. Individual dev machines
    2. Shared local server
    3. Shared server in corporate cloud
    4. Third-party LLM SaaS provider

Even if you don't want your laptop melting, there are still some important differences between 3 and 4 in terms of data privacy and security.

Comment by chris_money202 8 hours ago

on prem cloud is harder because of the scale up and scale down requirements. If you are a growing business which most decent ones are, you constantly have to think about that.

Comment by matheusmoreira 9 hours ago

> Everyone wants to shuck the chore and the responsibility.

Which gives all the power to the big techs. I'll never understand why the average company seems to have no problem with this.

Comment by keeda 8 hours ago

It's a longstanding management principle, so old that people may not even say it explicitly any more, which states "focus on your core competencies," the corollary of which is "outsource anything that is not a core competency."

I can see how it makes sense for companies, because money is "only money" but an ongoing operational distraction can be much more costly, as in, it can be detrimental to the success of the overall business.

Comment by akoboldfrying 9 hours ago

Did you build your own house using tools that you forged from iron-rich ore yourself? Did you grow your own wheat to make bread for your lunchtime sandwich today?

There's a reason most people pay other people to do these things for them.

Comment by derfurth 17 hours ago

That's an interesting take, however there is no ongoing maintenance related to local models, maybe the only effort is giving more capable machines to the workforce; but yeah I can see how it might feel like a barrier.

Comment by sathackr 16 hours ago

The hardware, the power systems, the cooling systems. They need maintenance.

The OS needs updates, file systems get corrupted.

Fans get dirty.

All the things that you need to deal with in hosting your own server infrastructure you have to deal with when hosting your own AI infrastructure (which runs on servers...)

Comment by ajb 15 hours ago

However, you can get many of the benefits of a "local model" by outsourcing all the hardware maintenance but still using an open model. Guaranteed repeatability for one.

A lot of the reason people outsource normal software is its brittle security properties, not sure that even applies to an LLM - it can go and look up the latest security best practices just like an engineer can.

Comment by davidw 16 hours ago

Still though, perhaps the existence of low-margin, generic, cloud LLM's puts some downward pressure on the 'brand name' companies?

Comment by otabdeveloper4 11 hours ago

> in the American business model

AI company valuations won't survive if they're only for the "American business model".

Comment by mohamedkoubaa 10 hours ago

Exactly. American businesses aren't even particularly efficient or well run

Comment by CamperBob2 15 hours ago

outsource that headache along with the responsibility for it

You know what gives me headaches? When I'm in the middle of a session and the model gets rug-pulled out from under me because somebody at the model provider didn't pay the Trump bill that month.

Or when someone at the model provider decides that the curve-fitting algorithm in my graphics package looks a little too much like Skynet for comfort.

Or when they do any number of other things to undermine my work for the sake of their business model, some of which I won't even notice until the damage is done.

The sad thing is, if you know how inference works, you know that it really is insanely wasteful for everybody to run it locally. If anything naturally belongs in the cloud, it's inference. But at the same time, what choice are we being given?

Comment by mohamedkoubaa 10 hours ago

What about inference suggests it naturally belongs in the cloud?

Comment by CamperBob2 9 hours ago

Inference basically looks like this (neglecting a whole bunch of stuff):

    for t in tokens_in_context
        for p in model_weights
            do something with p*t

The expensive part is fetching each weight from memory, which is why VRAM/HBM is such a big deal. Conceptually, for a huge, dense (non-MoE) model, the inner loop might run a trillion times for every token generated.

Obviously that's not how it really works in practice, but the point is, if you are only running one prompt at a time, each weight gets fetched, applied to the token being processed, and then never touched again until the next token is processed.

So when you submit a prompt to a model that's running a bunch of other peoples' contexts concurrently, it can reuse each weight multiple times before moving on to the next one:

    for p in model_weights
        for u in users
           for t in u's context
              do something with p*t

The same is true in an agent-heavy scenario where you have several contexts in play at once.

Worst case, in terms of energy efficiency, is a single user sitting around waiting for a single response. I don't feel like I'm explaining it well, but the core idea is that every time a weight is fetched from memory, you want to get as much work done as possible with it.

Comment by mohamedkoubaa 9 hours ago

That makes a lot of sense, thank you. I think a pirate cloud of local models could make sense, but that would be regulated into oblivion

Comment by starshadowx2 10 hours ago

Earlier I was thinking it's maybe comparable to paying for Netflix vs torrenting and running Plex or something. For the majority of normal, mainstream users I feel like most would just pay for the thing that is already setup and ready for them. There'll still be all the more techy or determined types who will do it themselves, I just wonder what the percentages of both groups will be.

Comment by aeonfox 5 hours ago

> I feel like most would just pay for the thing that is already setup and ready for them

Nothing stopping turnkey OSS AI hardware being productised, including niceties like opt-in automated updates. If the trend continues of models becoming smaller and more capable for everyday use, it also derisks against obsolescence.

Comment by rapidfl 7 hours ago

if we get to the stage where the AI hardware is a more of a commodity and usability becomes 10x simpler, then people may buy their own hardware and run local models.

Everybody owns a car, washer, TV, etc today. Maybe one could finance a server-box/trailer costing $20k, trade it in every 7 years for a newer model, etc. Many people are going to own a $20k Optimus.

Comment by fragmede 6 hours ago

The car, TV, washer, and whatever humanoid robot finds product market fit physically need to be in my house, or close to it, in order for them to be useful to me. Thanks to the Internet, the data center doesn't need to be, like at all. Economy of scale says that renting a slice of time on the most expensive GPU supercomputer out there is going to be faster and also probably cheaper since I'd only be getting a slice while the server is serving multiple users.

Comment by jhonof 5 hours ago

Why don't you just buy a chromebook or equiv and do that now then?

Comment by indoordin0saur 17 hours ago

I'm curious when coding-heavy companies will start running their own on-prem AI clusters. Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it? I imagine this won't appeal to everybody but with the trust issues the hyperscalers have developed hoovering up people's data and using it to train their models, I imagine some will find value in a machine and model they have transparent control over including the option to walk over and unplug the thing.

Comment by CamperBob2 15 hours ago

Has anyone had the idea to sell something like 4 GPU machine an engineering team could throw in a closet somewhere and run whatever they want on it?

I think that's basically Geohot's business model at Tiny Corp.

Comment by storus 16 hours ago

They are working hard on you not being able to run a thing locally. OpenAI buys all RAM on the spot market, causing the rise of RAM/VRAM prices 6x, making GPUs and decent computers unreachable for the majority of the population. OK, some richer folks might be able to get a 512GB MacStudio or a single RTX Pro 6000 for 13k and be able to run some decent local models, but the vast majority will need to use API. And at some point Nvidia might say: "We don't sell that many 6000s, so let's just cancel them altogether as we can gain 4x profit on datacenter-only GPUs" and then they'll become unobtainium and no private person would ever be able to run anything decent (~1 year behind the frontier) locally.

Comment by nodja 12 hours ago

I wonder if this move will backfire on them. All the fabs are focusing on HBM and leaving DDR behind, if one of the big frontier labs folds all the memory fabs will be left holding a big bag of HBM memory. They won't have any other choice but sell for cheap so it wouldn't surprise me if we see a return of HBM in the consumer market in 3-5 years.

Comment by wuliwong 17 hours ago

These local models can do some of the work the non-frontier models can do but for me, that's not worth much. If I am just using Sonnet 4.6, I can pretty much work all day on the $20/month plan. And Sonnet is still a way more powerful model than a one you could self host on an M2 mac.

If things change to token usage billing for everyone, maybe I'll be singing a different tune but on a subscription, I don't think it makes sense financially.

Fun? Yes. Financially sound? No.

Comment by mohamedkoubaa 10 hours ago

What about when the gravy train stops and Sonnet is priced with some marine above the cost to provide it?

Comment by bityard 16 hours ago

The general consensus is that local models will continue to improve drastically, but hosted models will as well. There will _always_ be a pretty big gulf of capability between what you can do with a desk full of hardware at home vs a few racks of hardware in a datacenter. That seems to be the real "moat" of hosted models at this point in time: access to capital.

What's interesting/exciting is that local models are _already_ quite good at tasks we never imagined AI _ever_ doing before ChatGPT hit the scene just a few short years ago.

We're also in an interesting point in time where companies are releasing the fruits of their research/labor (the LLMs) to the general public for free. For now, I think they see it in their best interest to gain mindshare and rapport, as well as advancing the state of the art in smaller LLMs ("a rising tide lifts all boats") but I fear and expect that these will dry up as the major players buy the minor players, and all will seek a return on their considerable investments in AI research.

Comment by cogman10 16 hours ago

I believe there's a level of diminishing returns. Sure, SOTA will probably always benchmark better than local models. But do we need it? That's the question that the likes of OpenAI and Anthropic should be worried about.

Comment by regularfry 15 hours ago

The difference won't be in the individual tasks. It'll be in the scale of job they can take on and how you interact with the model. Think of pairing with a junior vs replacing a full delivery team, that's the sort of difference we'll be looking at. We'll be able to get closer to the latter by being more clever with harnesses, I reckon, but the frontier labs will run ahead because for any given harness trick they can lean harder on model smarts.

Comment by cogman10 15 hours ago

True, but my point is that if/when local models get to the point where they are capable of doing the "delivery team" work what's next? What can these bigger SOTA models offer? And especially what can they offer above and beyond what you might be able to get from much cheaper models which the open models are based on?

That's what I mean by diminishing returns.

Comment by spockz 14 hours ago

There is also the thing of workflow.

We have set up something where you create a ticket, Make sure it contains enough information, and with the right tag added it will make a branch with PR for you which stays up to date based on updates to the ticket and comments on the PR.

It’s creepy in a way. But you also can’t really use local (as in workstation LLM) for that. Sure we could run something like a distributed task scheduler across all our engineer devices but just pushing it to copilot is easier.

Comment by mohamedkoubaa 10 hours ago

It the model is as good as composer, has a decent harness around it, and isn't incompetent at tool calls - it'll be useful at least as a sub agent for most workflows in perpituity.

Comment by rimliu 2 hours ago

Nothing will improve drastically anymore. And when big ones run out of money to burn, who will train your local models?

Comment by icoder 17 hours ago

What I don't understand is that on one hand we read 'what they charge is much less than it costs them' and on the other hand this thread seems to suggest that 'what they charge is more than it would cost me'.

Comment by bluGill 16 hours ago

What it costs is tricky to measure. A large part of the costs are training the model. Once they have the model they are making a ton of profit from what they charge (or so we think - I haven't seen the numbers). However the sunk costs of getting the model need to be paid for and that means an accounting problem where we have to guess how much the model will be used in the future.

Accountants are reasonably good at figuring this out - there are a lot of different things that need a large upfront investment before you can charge anything. People still debate if they are correct in this each case.

Comment by esailija 17 hours ago

Bigger models that Antrophic want to sell cost disproportionately more (e.g. 100% more cost for 5% performance improvement) than small models you would use locally

Comment by 17 hours ago

Comment by 15155 13 hours ago

They have to provide the service at peak scale and high-availability, your local setup doesn't have those extremely expensive requirements.

Comment by themaninthedark 17 hours ago

Maybe that is why they are buying up as much hardware as they can? If their service is the only game in town.

Comment by otterdude 17 hours ago

Data Center providers are buying hardware, not anthropic. Certainly related but alot of the hardware purchased is just sitting in a warehouse waiting for a data center to get built.

Comment by frollogaston 14 hours ago

Anthropic isn't just renting out compute, they're renting out a closed model that's better than anything you can download for free. So they're rightfully focused on preventing others from distilling their model.

Comment by fmap 2 hours ago

It's in Anthropic's best interest to focus the conversation on "distillation".

Imo the more interesting thing to focus on is that there are now several more labs with the expertise and capabilities to train trillion parameter models. That's a serious technical accomplishment and the main reason why open models are catching up to Anthropic and OpenAI (and local models are typically distillations of much larger models).

Who cares that they got some small amount of training data out of Claude. The crux is that the big US labs are not special, they just have a first mover advantage that's slowly shrinking as incremental progress becomes harder.

Comment by ActorNightly 13 hours ago

Local models will never achieve "real" performance (i.e actual usage, not benchmarks) compared to frontier models.

Comment by pessimizer 14 hours ago

> but a lot of people are going to multiply the per-month charge by 12 or 24 and say "Could I set up a local model for less than that, and have it pay for itself within a year or two?" And if a significant portion of customers decide to buy instead of rent, the companies whose business model is entirely centered around renting will suddenly find themselves hurting for customers.

And those are going to all be big enterprise companies that probably will set up LLM services entirely in-house, because they've got the headcount to utilize servers at 100%.

I wonder if there will be (or is currently) business in selling their compute while they're not working, to opposite time zones, etc.

What's left for the big providers will be the dregs of individual subscriptions and small businesses that at their least paranoid might let employees just use their own subscriptions for work.

Comment by sbmthakur 16 hours ago

Someone was able to run gemma-4-26B-A4B on an i5-8500 with 32 gb ram with NO GPU. Granted this is an extreme example these MoE models are value for money for a lot of use cases.

https://www.reddit.com/r/LocalLLaMA/s/YontVNVRbL

Comment by pornel 12 hours ago

[meta] I wonder why people have such wildly different bar for what is "good" agentic coding?

In a way, it's absolutely amazing that we've went from "Playing 'Set a Timer' on Apple Music" intelligence to something that may pass the Turing Test, but in practical terms the small models are still far from what I'd call "good" for more than a tech demo.

To me, 7B models are just a fuzzy echo of Wikipedia. Gemma models at 4 bit are too clumsy to even reliably generate JSON for tool calls or copy a line of code to apply a patch.

Qwen needs so much detail and babysitting to stop it from doom looping or losing the plot, that the instructions that I need to give are usually longer than the code I end up keeping.

Is there some magic prompt that I don't know? Do other people just have a lot more patience, or way lower expectations?

Comment by papersail 12 hours ago

I had similar doubts. I think expectations differ because the workload differs. For small scripts, glue code, or simple CRUD changes, smaller models such as Qwen3.6-27B can work wonders than they do on a larger, messier code base.

Comment by cheschire 7 hours ago

Haves and have nots.

We aren’t wealthy enough to have the hardware that would make this good.

The people who have the money to buy a spare maxed out Mac mini just don’t get it. I see lots of folks with RTX 6000’s in threads like these. Or any RTX card that ends in “90”.

Cloud AI is what allows the proles to participate in the broader AI conversation, but not these AI conversations.

Comment by monegator 52 minutes ago

But cloud is what will enslave them to the corporation's will.

Google (of all companies!) demonstrated you can get useful stuff with reasonable performance with model running local on their smartphones.

Depending on your expecations you can get the local models running on a recent enough laptop, you just need 16GB of ram to be comfortable. It certainly exceeded my expectations (but i don't use the LLM to write code, only to do the real boring stuff: docs.)

Comment by verdverm 10 hours ago

There is a lower bar (that gets lower over time), but ime, the config you are describing is too low still.

qwen/gemma in the 27/35B range @fp8 are better than gemini-2.5, but less than gemini-3.1, you can run DS4-flash @fp8 on two DGX spark, and things keep becoming better. DiffusionGemma came out recently with 4x token gen speeds.

tl;dr - the models you appear to be trying with are too small or too quant'd

Comment by embedding-shape 17 hours ago

Show us the resulting code of using them! :) I want to use local models, I have the hardware for it, but while trying them out as replacements for GPT 5.5 xhigh or Opus or other SOTA models, they aren't quite ready to be replaced yet, sadly. The quality and bumps they encounter just slows down the workflow so much, even screwing up tool call syntax sometimes.

But, for smaller more well-defined workflows, or as straight "edit this part to be like this exact" edits, they seem more than enough. Still waiting for them to become mature enough to be able to replace what we have as SOTA today, I'd say it's ready to be switched over then.

Speaking of local models, DiffusionGemma (and diffusion models in general) should not be slept on for local usage! Usually the problem locally is that the LLMs aren't efficiently making use of your hardware, unless you start batching requests and run many at the same time, but that require different approaches in general. Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Today I finally finished porting diffusiongemma-26B-A4B-it support from Transformers into Candle, and together with some optimizations I now have it basically flying with ~450 tok/s (~19 it/s) in Candle during inference, instead of ~180 tok/s (~11 it/s) from HF's Transformers library. Even using vLLM with similar sized LLMs, I don't think I've ever gotten past the ~250 tok/s threshold for single prompts, exciting stuff for local models :)

Comment by zozbot234 17 hours ago

> Instead, diffusion models work much faster for individual prompts, and not by a small margin either.

Diffusion models can't really be trained beyond low-to-mid size and have lower quality than an equally sized, plain one-token-at-a-time model.

Comment by embedding-shape 16 hours ago

As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?

Comment by zozbot234 16 hours ago

The issues around training diffusion models are well known among researchers. They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself, and their lower quality compared to an equally-sized auto-regressive model (the usual one-token-at-a-time flow) is also a matter of broad consensus.

Comment by embedding-shape 16 hours ago

> They're likely to not be feasibly scalable far beyond the 26B size of DiffusionGemma itself

I think people used to say the same about the 8B text-diffusion models too when they came out, like LLaDA. LLaDA2.0 seemingly claims 100B total / 6.1B active MoE diffusion (DiffusionGemma is also MoE). Not saying you're wrong about the current consensus, but it has a way of changing over time, might be a bit early to claim it's infeasible to scale them, especially considering the final artifact being much more suitable for local usage.

Comment by famouswaffles 13 hours ago

Difficulty of scaling is not the only issue. Nobody is going to be particularly invested in scaling an architecture that has:

- consistently proven behind their auto-regressive counterparts in quality. Look at the dgemma benchmarks - pretty steep dropoffs and the more difficult the benchmark the worse the dropoff. That's not a good look and it's not like its some artifact of google's release. Every dllm is like this.

- And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.

>"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"

Put yourself in the shoes of all the labs, even open source ones. Why would you put much effort into this ?

Comment by embedding-shape 13 hours ago

> - And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.

But my entire point is about the reverse of this, the context of what I bring up is in single-user scenarios, which is where these diffusion models really make a large difference in performance.

Sure, I agree it's not a good fit for every single use case out there, everywhere. But after starting to play around with it closer myself, I think people are dismissing it a bit too quickly, at least if you're interested in running local models on your own hardware.

Comment by famouswaffles 12 hours ago

I don't think you're really getting the point I'm trying to make. Everyone training llms regularly cares about serving users at scale and quality per compute invested. It's not just about OpenAI or Anthropic or Google. Qwen, Deepseek, Moonshot, whatever. They all care about it very much and basically can't afford to take a step back in those areas.

Since training models is currently a very expensive procedure, diffusion llms are destined to be relegated to the occasional research artifact at best. As things stand, making a serious commitment to them is basically the equivalent of throwing money into a fire pit and things are expensive enough as is.

Alternate Architectures that do a much better job matching transformers in quality have basically gone nowhere but you expect one that is basically worse in every way the labs care about won't ? I'm not trying to 'dismiss' dllms. I'm interested in them for the same reason you are. I'm just stating the factors at play plainly.

Comment by zozbot234 13 hours ago

Single user scenarios can also use MTP to make auto-regressive inference more compute-intensive with no loss of quality.

Comment by iagooar 16 hours ago

I love running two models locally: qwen3.6 27B 8bit (dense) and qwen3.6 35B 4bit (MoE).

The 27B is the smarter, more reliable one - but it is slower. The 35B is faster, still very smart but below 27B, a bit less reliable. The reason is the MoE - Mixture of Experts architecture, which only activates a subset of parameters, making the model much much faster.

I run the 27B on a MacBook Pro M5 Max + 40 GPU cores + 128GB RAM (well, on this beast I can have 27B + 35B in memory at the same time with headroom for all the other stuff). But because this is a laptop, it is not possible to run local LLMs all the time - it just gets too hot and too loud.

What excites me more: I run the 35B model on a MacMini M4 with 64GB RAM. It is fast, it gets a lot of work done (e.g. it scans, extracts and classifies my emails, it watches the mailbox all the time and does work). I also use it as my private Hermes assistant ("when is the next Starship launch?", "who is playing today at the World Cup? Give me some trivia").

Next step I am planning is a RTX Pro 6000 Blackwell workstation I can put in my basement. I want to run qwen really fast, with multiple threads / prompts / agents at once. And MAYBE if the budget allows, a 2x RTX Pro 6000 setup in order to run DeepSeek v4 flash on it (to run research on it).

Comment by jnaina 2 hours ago

how are you connecting the 35B model to your mailbox, for email classification?

Comment by Barbing 16 hours ago

Did you get a Brave search API key or something for that “Hermes”?

Comment by nickthegreek 15 hours ago

I have my mine setup with a searxng instance I run in a docker. Works great and costs zero.

Comment by dghlsakjg 16 hours ago

Hermes is just an agent that can be setup for whatever you want (coding or more commonly personal assistant ala clawdbot). You can set it up with any of the standard tools and MCPs like brave or tavily for search.

Comment by iagooar 15 hours ago

Yes, Brave search is one of these services I highly recommend paying for, the search they provide (similar to Exa, Tavily) is what makes an "OK LLM" become super smart.

Comment by verdverm 10 hours ago

I'm using SearXNG, EXA, Tavily, and soon (tm) Cloudflare

They all give slightly different results, you can dedup / fusion with heuristics / another agent

Comment by zerd 15 hours ago

I'd love an RTX 6000 Pro, but how can you justify it when it costs 10 years worth of Claude Max?

Comment by iagooar 15 hours ago

10 years worth of Claude Max today. Also - Anthropic recently removed a model I relied on and isn't giving it back. As a non-US citizen, I would rather pay in advance but be sure, I will keep having access to inference on my own terms.

Also, it will just be faster - and more fun too.

Comment by girvo 10 hours ago

Because they're fun to play with :)

But also because it's likely to hold most of its value into the next few years, based on the looks of things, too

Comment by sieste 51 minutes ago

The "middle powers" (cf Carney) should invest in local models, rather than relying on US and China allowing them to rent their AI models. It takes a single executive order to cut the rest of the world off of American AI tools. "I'm happy to pay whatever to rent frontier models from hyperscalers" makes sense if you're citizen of a superpower, but it's risky, naive, bordering on irresponsible to adopt this mindset otherwise, especially when your business or career depend on the tool.

Comment by k__ 11 minutes ago

Training DeepSeek was magnitudes cheaper than training the SOTA models it relied upon.

In theory, other countries should be able to replicate that effort and improve it.

Comment by angry_octet 10 hours ago

Programmers are used to paying nothing for tools. A basic laptop (SSD, multi core, 16GB of RAM) is hugely powerful if you are building in C/C++/Rust, even python. But all of a sudden it's no good, and we're back to using someone else's computer, hiring our tools every day. Worse, we get a different model every day, and maybe we aren't allowed to borrow the good tools some days because some mafioso are shaking down the manufacturer.

Most other trades need to invest significantly in tools. If you want good tooling, you really want 64GB of GPU memory (e.g. 2x 5090) and 96GB of RAM. If I'm paying $200k for an expert engineer then $50k every other year for tooling seems pretty reasonable.

Comment by rsanek 9 hours ago

Who's paying the $50k? I don't see how it makes sense to pay that much for a home-grown setup when I could pay <$5k/year total for both of the two best frontier models at effectively unlimited usage.

Comment by fragmede 7 hours ago

> best frontier models at effectively unlimited usage.

It would've been easy to spend $5k on Fable in the short week it was available. If that's the direction things are going (we can assume GPT-6 to be if similar class) $5k's not going to get you "best frontier models at effectively unlimited usage".

Comment by sosodev 17 hours ago

I think this is overselling their capabilities. I've used Gemma 4 and Qwen 3.6 quite a bit on my strix halo home server. They're great models and the dense variants are significantly better, but they're still very far behind the frontier. If you boot up Gemma 4 MoE and OpenCode/Pi and expect to perform anything like Claude Code or Codex you're going to be very disappointed.

Comment by kristopolous 8 hours ago

You need to switch out the prompts and work with it differently.

I posted this yesterday https://github.com/day50-dev/petsitter

I use it with https://github.com/day50-dev/simple-llm-cli

And modify the "tricks" until my evals get to good numbers. It's a model by model basis.

This is what the larger firms are doing - they have custom prompts per model

Comment by sosodev 5 hours ago

Petsitter's default tricks doesn't seem to do much for Qwen3.6, right? JSON mode could be useful I suppose, but that's not really going to make it better at writing code. Do you have any other example tricks? I'm having a hard time understanding how I would apply them.

Comment by kristopolous 4 hours ago

thanks for the feedback ... i'll work on publishing them.

I haven't include more sophisticated ones because they are complicated and I wanted to avoid the friction

Comment by schmuhblaster 8 hours ago

I’ve been playing around with qwen3.6-35b-a3b and managed to boost it significantly by leveraging my own custom harness [0].

It is quite astonishing to see how far local models have progressed, and I think that if you enjoy tinkering a bit, you can save a good bit of money (if you happen to have the hardware lying around anyways). Overall it’s still hard to beat the the cost/convenience combination of a cloud based model provider though.

[0] https://deepclause.substack.com/p/how-to-make-small-models-p...

Comment by phunterlau 1 hour ago

Cool, so the determinstic harness can boost the agent pretty much!

Comment by edg5000 4 hours ago

Harness engineering is very interesting stuff. Thanks for sharing.

Comment by chrismarlow9 17 hours ago

You can use a frontier model to create a plan that's specific enough for a local model of a very small size to execute on. The more specific you are and compartmentalize tasks the "dumber" the local model can be.

Edit: Obviously you'll be using more tokens but this is the trade off for running a smaller model and running locally. Similar to time memory trade off but in token economics. Sorry I need more coffee

Comment by segmondy 16 hours ago

It's more than good. As of today, it's great. Those models listed in the blog are horrible compared to what you can run today, There's absolutely no reason to run those, you have Qwen3.6, Gemma4, and plenty other sized comparable models.

If you're resourceful, you can even run SOTA models. KimiK2.7, MiMo-V2.5/V2.5-Pro, MiniMax2.5/2.7/3, DeepSeekV3.1/v3.2/V4-Flash/V4Pro, GLM5.1, Step3.7-Flash, Qwen3.5-397B, Qwen3.5-122B, gpt-oss-120B

Comment by agile-gift0262 1 hour ago

> Qwen3.5-122B

do you find Qwen3.5-122B to be SOTA-level? I moved from it to Qwen3.6-27B (both Q8), and I prefer 3.6-27B, and it leaves me room to spare for other small models

Comment by 0xc0c0c0 17 hours ago

I have used local models (around 128 gb) and the big proprietary models, and while I do want local models to win, it's important we keep the expectations of local models realistic. There are many blog posts about how local models today can fully replace some of the proprietary models and in some cases its true for the much smaller proprietary models, its very clearly much more behind the larger models.

You can be far more ambiguous with your tasks with the larger proprietary models as opposed to the local models. You can achieve the similar results with local models but you need to be much more detailed in your prompt.

One of the biggest things about running these local models is that the harness matters almost just as much as the model too. Codex is optimized for GPT models, CC is optimized for Claude, Cursor has a great harness that works very well across these providers. It took me a couple of iterations of the different harnesses to find one that would work well with the smaller Qwen models to do local coding.

Comment by failbuffer 16 hours ago

So which harness did you end up choosing?

Comment by ngxson 16 hours ago

My 2c: I think the "cloud vs local" debate is (maybe) a false dichotomy. In my experience, I use a hybrid approach and I've seen a huge productivity boost from it.

The cloud-based models are fine for big and complex tasks, but the pricing is ridiculous for small stuff—like summarizing a discussion or fixing a small bug. And cloud and privacy have never been a good match.

As an example, this comment itself was written with the help of Qwen3.5-4B running locally with an extension on top of llama.cpp default web UI [1]. The extension injects my browser's context directly into the conversation, which allows me to summarize things and draft up comments quickly. Speed is pretty acceptable for the size: ~5s TTFT and ~100 t/s generation, all running on a Macbook M5.

And when I want to run bigger tasks, I don't just stick to one provider. Apart from well-known closed-weight providers like OpenAI or Anthropic, I also experiment with open-weight models like GLM-5.1, DeepSeek V4, and Qwen3.6-27B, which provide quite good results for the price.

I'd argue both have value, and I don't see why anyone needs to choose one exclusively. Anyone else doing this?

[1]: https://github.com/ngxson/llama-companion

Comment by phainopepla2 16 hours ago

Why not just use DS V4 Flash for the small stuff? Very fast and extremely cheap.

Comment by ngxson 16 hours ago

The dsv4 flash is 158B params in total. It is possible to run locally but will require all my system RAM.

Also, a lot of my day-to-day tasks perform the same on both small and bigger models: summarize a web page, draft a response, translations, quick web search, etc.

Comment by phainopepla2 15 hours ago

Sorry, I meant non-locally.

I'm assuming privacy is not a concern since you mentioned using Deepseek already. The cost of V4 Flash for small tasks is so minuscule as to be almost free, and you don't have to deal with a churning laptop (or even buying a high-end laptop, for someone who doesn't already have one).

I guess what I'm really asking is, what's the advantage of using these small local models if privacy isn't a concern?

Comment by ngxson 15 hours ago

I do use both DSv4 the "normal" and the flash variant, non-locally. It works well, not exceptionally. And while it's cheap, I'd say that the difference between $1 per month vs $5 per month is not a big concern to me. IMO pricing is pretty competitive among open-weight models: https://huggingface.co/inference/models

Depending on use cases, but for me I found 2 use cases where a local model is a must and not optional:

- Running offline without internet access: for example, I have this project that allow transcribe and summarize audio in real time. I already used it in some events where wifi is not available: https://github.com/ngxson/llama.cpp-realtime-audio-recap

- Handle private personal data, for example health records. This is the same category of "privacy" that you mentioned, but I just want to bring up the fact that people value their privacy differently.

Comment by coder543 12 hours ago

dsv4 flash has 284 billion parameters, not 158 billion.

Huggingface's little parameter count badge seems unreliable.

Comment by 10 hours ago

Comment by delis-thumbs-7e 12 hours ago

Nobody asked, but I don’t think any of us should be using SoA models to code or to do pretty much anything at all. Instead we should develop open models to work on specific tasks and learn to code, write, draw etc. using fingers made of bones and brains made of flesh. Big corporations and research facilities can run them to generate code or math or whatever, with a bunch of specialists to check the output to be correct. Then again, even that might not be worth the costs (e.g. OpenAI’s 36B$ net loss last year), when the open models are so close and the whole AI scheme is running out of scams to pull.

There’s a lot of things we could use even quite small models for, which would not need an insane amount of computing power and memory, but too few of us is really researching them.

Comment by Tharre 16 hours ago

I've been running Qwen3.6-35B-A3B (and 3.5 previously) locally and it's a great model for many small tasks, probably a significant chunk of what most normal people are using LLMs for right now.

But for coding in a harness? In my experience it's unusable even for small projects. It just gets hard stuck at every little problem, wasting hundreds of thousands of tokens trying to make a convoluted solution work instead of doing the obvious thing. Or it will spend hours trying to reason through a fairly simple code flow, incrementally adding debug print statements, only to get confused by the output and then editing completely unrelated code that it convinced itself is the problem.

I've tried instead giving Sonnet the problem description and code and have it come up with a detailed plan that Qwen should implement, but doing that actually consumes a significant amount of tokens compared to just telling it to implement everything, and the results are honestly not that much better. There are just too often subtle issues with the plan that Qwen doesn't recognize when implementing, but make the resulting solution it comes up with unusable.

Comment by androiddrew 9 hours ago

$2600 will buy you two AMD 9700 gpus with 32Gb ram per card running about 285 Watts per card. Less than a 5090 in both cost and power. A VLLM build patched for AITER and you can run Qwen3.6 27B FP8 at roughly 45-50TPS during real coding sessions with Opencode or PI with a full context window. I really hope more 30B dense models continue to be released, but Qwen3.6 should get you a lot of agentic mileage.

ROCm stack is not for people though who aren’t willing to dig in and patch things themselves.

Comment by jnaina 2 hours ago

Running Qwen3-30B-A3B-Instruct-2507-AWQ-4bit on an Olares One with NVIDIA GeForce RTX 5090 Mobile GPU (24GB GDDR7 VRAM) and an Intel Core Ultra 9 275HX processor.

Plenty fast for coding work and for sharing with my OpenClaw setup.

Currently in the process of adding another external GPU (RTX 4090 with pipeline parallelism) via thunderbolt 5 to the Olares One box, for higher quantization, possibly 8-bit, larger context, better concurrency, more kv cache.

Comment by _doctor_love 17 hours ago

"Just get a 64GB Mac with 1TB of storage!"

LOL - some of us have a budget

Comment by swatcoder 17 hours ago

Sure, but it's also not really out of scale with the cost of a shop tool in other trades.

If you're a professional that's confident in a positive return on the investment (optimal or not), or just a hobbyist with the luxury budget for a "shop" that cost is well within norms.

That's not everybody, of course, but it's not some inconceivable fantasy. A lot of people in the tech community here on HN, specifically, end up with pretty high discretionary budgets that they pour into stuff like this.

Comment by frollogaston 14 hours ago

But you can get that return from a paid service too, in fact it'll be better. So just comparing costs, what's the annualized ROI on the Mac Studio assuming it means you avoid paying $240/y for Claude? Cause I can always set aside the Mac's price in some investments and pay for Claude out of that.

Comment by swatcoder 13 hours ago

Same with many and their shop tools in other trades.

Most hobbyists and many professionals could end up far ahead financially by leveraging makerspaces, tool rentals, and co-op shops or even by hiring out a professional to prep certain intermediates for them, but they get psychological value -- as well as flexibility, reliability, and resale opportunity -- from having their own well-outfitted shop.

And they can afford that premium, so they do. At the scale of individuals and small shops, not everything that matters gets captured in financial models.

Comment by frollogaston 13 hours ago

Yeah but the local model doesn't have those advantages for the coding use cases, at least not yet. In theory you could post-train one on your codebase or something, but nobody cares to do that when any vanilla coding agent service can read and understand the whole thing better than a locally tuned free model. I was already being very generous towards the Mac in pretending it does the same thing as the paid service.

Aside, physical tools tend to be financially advantageous to own if you're going to use them a lot. Even if the owner were targeting 0 profit, they'd have to charge more to factor in the cost of dealing with customers and increased risk of wear/damage by users who don't care as much.

Comment by swatcoder 13 hours ago

The shifting sands of commercial models or pay-per-use managed models are just really not appealing to a lot of people.

Most come with huge privacy concerns, total costs and availability are impossible to forecast very far out, and the specific behavior of frontier models in particular is not something anybody can rely on as those are subscription products that are subject behavior on their publisher's whims (whether from changing system prompts, new "safeguards", retired models, forced "updates", new regulations, etc).

It's quite hard to put a price on all that, and as more people find local models productive enough or develop curiosity to explore models, training, or harness-crafting in their own ways, the marginal cost of buying some shop hardware just sort of disappears into the budget noise for plenty enough people.

Comment by Gigachad 8 hours ago

Hosted is still much cheaper and you get a better model. Some day I imagine the gap will close but it hasn't yet.

Comment by amalcon 17 hours ago

A Strix Halo with similar RAM is considerably cheaper. Still not cheap, mind, but performance is OK (not great) and it will run more or less the same models.

Comment by AbsurdCensor 17 hours ago

At least for me, it's been pretty great, but I bought my system when it was $1800, now looks like the same system is $2700 and out of stock. I still haven't quite been able to run 120B parameter models under Windows, but for Qwen Coder 30B, it works pretty darn well for my at home needs.

Comment by amalcon 17 hours ago

Yeah, they have gone up a lot since I bought mine too. I did get Qwen3.5-122b running on all-GPU (on a 128GB machine) under a minimal Arch Linux setup (I do my GUI work on a much cheaper box). It worked, but Qwen3.6-35b is performing almost as well and a lot faster.

Still cheaper than a new Mac. Maybe not cheaper than a used one.

Comment by AbsurdCensor 14 hours ago

I've certainly thought about just moving the box to Linux, but it took far to long personally to get everything running under AMD and it works 'well enough' that I don't want to make the switch. I tried playing with GAIA on it, felt a bit limited, and now have Hermes up and running, and that seems to work quite well. All the tools are changing so quickly, it's sometimes difficult to settle in on 'what's best', so I certainly can understand folks that just want to pay for a AI subscription and be done with it.

Comment by tjwebbnorfolk 17 hours ago

AI and budgets don't mix well at the moment

Comment by techscruggs 17 hours ago

He is using a 2022 M2, which you can get that for about $2k used. That is beyond reasonable.

Comment by Shekelphile 17 hours ago

She

Comment by psychoslave 17 hours ago

Global Affordability Estimate:

Top 10% of global earners (~800M people) can afford a $2,000 device without major financial strain.

Top 25% (~2B people) could afford it with some budget adjustments.

Bottom 50% (~4B people) would find it prohibitively expensive.

So for a SV top income, maybe that might look more like the weekly pet brushing budget, but for most people out there this is not that much of a no-brainer.

Comment by disgruntledphd2 17 hours ago

The maths changes if you're working for yourself. Because I live in Europe, I've ended up working as a contractor due to the lack of a legal entity in my country. While that mostly sucked for a bunch of reasons, I was able to get a 64Gb Mac M2 a few years back with approximately a 52% discount, which was kinda nice.

Comment by weego 17 hours ago

If you're working for yourself paying monthly is exactly the same as amortising an asset. Personally I'd rather my business just pay $100 a month than have to deal with additional hardware and software maintenance while using a depreciating asset that is break-even after 3-5 years depending on the spec.

Comment by 14 hours ago

Comment by frollogaston 14 hours ago

Bottom 50% aren't paying for Claude either, probably also don't own PCs or write code

Comment by richwater 17 hours ago

Yes, because the bottom 50%, mostly impoverished or near impoverished folks were spending money on Claude Code subscriptions instead /s

Comment by themythfable 17 hours ago

Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

Besides those with effectively unlimited budgets for their personal compute, local models are still a long ways off.

Though, that shouldn't be conflated with the value of open-source models, which can be used by cloud providers to significantly reduce cost of intelligence.

Comment by embedding-shape 17 hours ago

> Yeah, I never had a computer that cost north of $800 until recently. While that is far from the typical HN user's budget, my bet is that it is much closer to average.

There are segments, everything from "Average person in world" to "Average creative professional using computers for work" and more on HN, with a wide range of costs for the hardware. HN probably skews towards the latter rather than the former, probably sitting with enterprise hardware next to them basically for fun, hard to make wider conclusions from what people here have or not.

Comment by sublinear 17 hours ago

If we define "typical" as the median HN budget, it's probably about the same as yours. Maybe the answer would have been different 10 or 20 years ago, but the era of truly needing a big budget PC has been over for a while.

It's just for gaming and AI now. Maybe not even gaming as much anymore.

Consider the perspective of someone who has a practically unlimited budget for PCs, doesn't game much anymore, and doesn't need AI to do their job. It's just part of getting older, and there are plenty of people in their late 30s and older on here.

Comment by anarticle 17 hours ago

Pros buy their own tools. This is why working for yourself is better than working for a corpo, you get to choose your weapon.

Comment by p-e-w 17 hours ago

No need. You can run the Gemma 4 and Qwen3.5 MoE models with as little as 12 GB of VRAM at 30-40 tps (Q4/Q5), and they both blow GPT-4o and DeepSeek R1 out of the water.

Comment by dofm 17 hours ago

[dead]

Comment by minton 13 hours ago

I’m glad people are looking into this because I do think it’s the future. However, why would you not take advantage of the heavily subsidized frontier models while you can. It’s obvious that they’re gonna have to raise prices at which point it might make sense to consider local models, but not today.

Comment by fendy3002 13 hours ago

Curiosity or anticipation I think. I have tried it in the name of those 2 factors, because when the frontier model price increase happens and we don't know anything about local models, we're screwed

Comment by ios-contractor 2 hours ago

I subscribe to this guy on youtube for local model stuff if anyone is interested https://www.youtube.com/@AZisk. I'm not affiliated and I'm not even a paying subscriber. But I like all stuff local.

Comment by andwhatisthis 1 hour ago

I clicked and immediately subscribed, but then checked out his latest videos and was so put off by the stereotypical clickbait stuff (stupid faces on thumbnails, "I tried (...) and then THIS happened" etc) that I unsubscribed. I understand that it must be what one needs to do to maximize views and brown nose the recommendation algorithm but I just find it incredibly off putting

Comment by anubhav200 17 hours ago

I have been using qwen and glm based models from last 2 years, ended up buying mutiple machines for the same. Overall i feel 24vram is a must have to get get performance (speed wise) to match hosted soln. I have 2 machines a 12gb vram one and a 24gb one. On 12gb vram i get around 50tps generation and 500tps prompt processing and on 24gb one i get 180tps generation and 3500tps prompt processing. I have different configs for different scenarios and I also use llama cpp manager manage all my configs (https://github.com/anubhavgupta/llama-cpp-manager)

Comment by K0IN 10 hours ago

In a day to day base i host Qwen3.6:27b, but i *Really* want to host deepseekv4 flash, its such a "good" model for its size/speed/price.

I really wonder when companies will start hosting theire model for everday tasks on prem, cause its good enough (and realative cheap), instead of paying subscriptions for all devs.

Comment by dejawu 16 hours ago

If vibe-coding is hopping into a self-driving car and telling it to take you anywhere you can get a coffee, then I use coding agents more like a bicycle - they let me get further faster than if I'd walked, but I still have to decide where to go and how to get there, and I still have to pedal.

I don't vibe-code, but I do decide what to implement and what patterns to use (perhaps asking the model to analyze and give advice on this first), then I have it handle the nitty-gritty of the implementation itself. For this usage style, the latest local models are as good as having Claude at home.

I won't say it's been _easy_ (I ended up implementing my own harness to accommodate the idiosyncrasies of local models), but I will say that for the effort, having a coding agent that's essentially free to query as much as I want has been life-changing as a dev, especially when it comes to working on side projects. Knowing that my agent will never get worse in quality, suddenly cost more than it does now, or be suddenly made unavailable by external factors, was absolutely worth the trouble. And on top of all that, I can't believe it's as good as it is.

Comment by simonw 17 hours ago

I think gemma-4-26b-a4b and Qwen3.6-35B-A3B show that there's something very interesting about a local model that does mixture-of-experts (which helps a lot with performance) and has in the order of 30 billion parameters.

These models are very capable, and use around 20-30GB of RAM while they are running.

Provided you have 64GB of RAM that leaves space for running other applications at the same time.

Comment by chrisweekly 17 hours ago

Obtaining that 64GB RAM is a meaningful obstacle for many.

Comment by simonw 16 hours ago

I'm still amazed that you can run LLMs of this quality on a machine that costs less than $3,000.

I used to assume that anything GPT-4 equivalent or higher would need $30,000+ of server-class hardware.

That said... gemma-4-12b-qat is 7.15GB on disk so should run reasonably well in 16GB, that takes it down to MacBook Air territory https://lmstudio.ai/models/google/gemma-4-12b-qat

Comment by frollogaston 13 hours ago

Not just RAM, VRAM, right? Though they're one and the same on the Mac.

Comment by bayshark 12 hours ago

Hey everyone, made a local LLM, configured for Home Assistant called Selora AI.

Specs: qwen3_17b_base.Q6_K.gguf selora-v047-answer.f16.gguf selora-v047-automation.f16.gguf selora-v047-clarification.f16.gguf selora-v047-command.f16.gguf

The full base model and LoRA adapters are only 3.5GB

Capabilities include configuring for smart home setup to help with answers, clarifications, commands, and creating automations in Home Assistant. The models with the LoRA adapters were made with lean scripted data made specifically for Home Assistant. A lot of work was put into this, feel free to give it a try and happy for any feedback!

https://huggingface.co/selorahomes/Selora-AI

Comment by linuxhansl 7 hours ago

I soooo wish that to be true. Alas, in my experience it is not... Yet.

What is true is that it gets easier and faster to run local models. With QAT (quantization aware training), turboquant (or similar) K/V compression; what used to be impossible to run is now fairly easy.

I can run gemma4:26b-a4b-qat on my laptop with 20-30 tokens/s with a 256k context window. That was unthinkable just 6 months ago.

So the local models are "OK" for small'ish projects.

But it does not at all(!) compare to the frontier models. For a large project Claude's Opus 4.6+ just work, whereas local gemma tangles itself up, makes weird mistakes, and just can't handle it (for those cases it is faster if I do it myself).

If the trends continues, with 1.58bit QAT models, even better K/V compression, faster multi-token prediction et al, maybe soon it will be comparable.

Comment by infogulch 12 hours ago

Anybody used a tinybox? https://tinygrad.org/#tinybox

The most "affordable" option is red v2 with 64GB GPU ram and costs $12,000. This is only ("only") 1.5x-3x the price of a beefy desktop (https://pcpartpicker.com/builds/), and could crush inference work even on bigger models. It could support coding tasks for a small team of developers, or run an AI agent for every person in your household...

Comment by pornel 9 hours ago

64GB VRAM is too little to run good coding models IMHO. May be useful if you need voice models or run some slightly-smarter-regex batch processing or RAG workflows. Perhaps you're supposed to buy 4 or 8 of these and split inference across them.

If you have $12K to spend, you may be better off with DGX Spark or a Mac with 128GB VRAM. That can (barely) fit DeepSeek V4 Flash.

Comment by gregwebs 15 hours ago

All these conversations seem like they are missing talking about planning vs execution. I want the best possible frontier model to plan out my changes. I also have a 2nd agent that is a frontier model check the plan. Then at that point the implementation can be done by a lesser and possibly local model. The frontier model can still do a final code review on the implementation of the changes.

Claude code supports this by setting the model to "opusplan"- it will automatically use Opus for planning and sonnet for implementation. This was completely necessary with the fable release. I was able to do this with fable and it was necessary to avoid getting quickly rate limited. In settings.json:

"env": { "ANTHROPIC_DEFAULT_OPUS_MODEL": "claude-fable-5" },

Obviously have that set to "claude-opus-4-8" now.

Comment by noveltyaccount 13 hours ago

I do this with Codex 5.5 for planning (specs, technical design, and task list); and Qwen 3.5-35B for task by task build out. It requires more hand holding and makes more mistakes than using Codex for everything, but it helps me spread my $20 chatGPT subscription pretty far.

Comment by richbradshaw 17 hours ago

I’m keen to understand speed here etc etc. if I bought a Mac studio with 96GB - what can I realistically run, how’s it compare to fable/opus etc and how fast is it?

Currently maxing out two Claude code accounts every x hours when working on large code migrations or setting up new iOS apps etc - most of time it’s fine but occasionally it’s mega frustrating!

Comment by simonw 17 hours ago

I strongly recommend trying LM Studio - it's the lowest friction way to try out models, you can browse https://lmstudio.ai/models and click "Get" and then "Run in LM Studio" to download and run a model.

With 96GB I'd start with the Gemma 4 and Qwen 3.6 models. Any of those should work fine.

Comment by AbsurdCensor 17 hours ago

I think currently you can only get the M3 Ultra Studio with 96gb, and for coding tasks, say you rub Qwen Coder on it (which doesn't need that much ram), it's not the fastest, something like 30-40 tok/sec. Probably better with a MacBook Pro with the M5 chip. There is a website for comparing different configurations and models: https://llmcheck.net/benchmarks

Comment by pizza234 17 hours ago

[dead]

Comment by tpurves 6 hours ago

I do think local models are huge pending market opportunity for Apple. An M5 Ultra Mac Studio (if that exists) could be decent local AI machine, though so expensive as to stay niche. But by the M6/M7 generations and a recovery in DRAM affordability, the future could be interesting moment for them to deliver a compelling local AI platform that 'just works'. But I do think that a mini-pc that is easy to configure, can be always plugged-in, always on, higher power envelope than a laptop, but not obnoxiously loud and hot, is the right form-factor

Comment by BenRacicot 6 hours ago

Agreed, this is what caused me to build. This thesis exactly.

Comment by AgentMasterRace 1 hour ago

If you have an extra PC and enjoy 5 tokens a second... Sure

Comment by ptx 13 hours ago

> Security: I run every Pi session in a Docker container and give it permissions only to bash so that it can’t run Python code or do web browsing

How does that work? The script in the post references the file "docker-compose.sandbox.yml", but I don't anything about what that file does.

The post that this one links to, that it's based on, says that Pi doesn't do proper sandboxing.

Presumably bash can still execute other binaries, otherwise it would be fairly useless. What stops it from executing Python? Or opening a network connection and downloading Python?

Comment by huydotnet 16 hours ago

I love that local LLMs are being discussed more often on HN recently. But for the post, I find it strange that the author claimed they were working with local models from day 1, but wrote a post that still links to Qwen2.5 and Qwen3 in mid June 2026.

Comment by polotics 12 hours ago

So I've made this [me+vibe+tests]-coded Android alarm app called Promptly, and as Gemini-CLI on the Google Pro subscription is getting google-killed on June 18th, I set up two branches, one for Antigravity+Gemini3.5 and one for Pi-coding-agent with Qwen3-Coder-Next...

Running the same prompt on both with the same .md memory state...

Gemini3.5 is more "intelligent" but Antigravity gets it to decide to go on tangents that are quite time and token-consuming I think. Nice casino machine.

Pi+Qwen3 (~80GB, llama.cpp) is like vibecoding about 1.5 years ago, when you had to babysit, structure your program to have self-contained chunks, and keep an eye on all the cross-cutting concerns to not trip it up. When it works it works fine and when it fails it's my job to ensure it fails fast.

The code is about 10'000 lines of Kotlin in total so it already takes some effort to keep it simple for the AI. It's not a slopped quantity of code, i got solid feature creep :^)

https://play.google.com/store/apps/details?id=com.sixteenam.... ...hat tip to the recent copycat squatter btw it's an honor!

Comment by xbmcuser 3 hours ago

Running local models might be good but until the virtual hardware monopolies of tsmc and others is broken they will out of reach for most people.

Comment by ricardobayes 2 hours ago

They are good, and yesterday's release GLM 5.2 even benchmarks really close to Opus.

Comment by ltononro 16 hours ago

Good depends a lot. If you are in the token maxxing hype you will probably find these models very bad comparing to SOTA, unfortunately.

The good news might be: opensource models are now good (enough) for day2day usage. But is it really? I feel that companies will always naturally strive for the best and use the SOTA (as long it is not too expensive).

I see OSS models being a good backbone for companies in the future that have validated workflows and could use those for privacy or to spare costs.

IDK, might have gone a little bit off-topic here.

Comment by lanycrost 3 hours ago

I'm crazy for gemma and Qwen, really hope we will be able to run LLMS everywhere like a Doom

Comment by aquarious_ 15 hours ago

I support local models and enjoy playing around with them, but even for personally development it is just more viable for me to pay $200 a month to Anthropic for the latest models. It seems to me with the cost of hardware needed to run local models that, for now, it is pure hobbyist and exploratory (which is fun in its own right)

Comment by pjmlp 15 hours ago

Only if blessed with enough RAM and disk space,

> 64 GB RAM and 1TB storage

Ah ok, not something regular joe and jane happen to have lying around at home.

Additionally the whole configuration is still very much low level, bunch of CLI commands, and if the model doesn't fit for the task at hand, it starts allucinating, generating gibberish, whatever.

Comment by sparkling 13 hours ago

Even if i had such a machine, im not sure i would be willing to sacrifice 80% of my RAM and 50% of my disk to run a semi-okay model locally.

Comment by wxw 17 hours ago

> “if we are constrained by performance and price, what architectural tradeoffs do we need to make?” a question that so far has not really been asked in the mad token gold rush.

To be fair, I think the labs are also interested in this (e.g OpenAI parameter golf). But the incentives are tricky. When the subsidies and tokenmaxxing era ends, local models will be essential.

Comment by andix 15 hours ago

Because I've seen too many people spending a lot of money on expensive hardware, without really using it in the end:

Most of those models are also available via Openrouter and many other platforms. Dirt cheap, and much faster than on consumer GPUs. Perfect to try and compare the different options.

Comment by 13 hours ago

Comment by jlengrand 14 hours ago

Just wanna say it's always fun and nostalgic to see authors pass by here who I was reading back when I started my career. I was reading Vicki's blogs way back, even remember learning some email parsing in python from her over 10 years ago. TY!

Comment by hank808 9 hours ago

Local models are good? Or are we saying that open source/open weights models are good? What I'm asking is, are they good because they are "local" or are they good because you can install and run them yourself, wherever you want? Same node, different node, different cluster, way out in the ether/cloud...

Comment by abalashov 15 hours ago

And if you want to dial in a setting in between: I've switched to Kimi K2.6 (now K2.7) and DeepSeek through OpenRouter and Reasonix for pretty much everything, with no discernible loss of analytical quality or utility.

However, like many commenters, I don't really believe in vibe-coding, long-horizon agentic one-shot agentic coding, etc. and do not use LLMs for huge generation tasks that involve designing things end-to-end.

I also have an MBP with 128 GB of unified memory and do quite a bit of Qwen3.6-35B-A3B. No, it's not as smart as the aforementioned models, to say nothing of frontier, but many people seem pleasantly shocked by the number of banal tasks that do not require these.

Comment by noveltyaccount 10 hours ago

From the recent Nvidia & Microsoft announcement about new chips for consumers:

> “Our goal is to deliver unmetered intelligence to every home and every desk with Windows,” said Satya Nadella, chairman and CEO of Microsoft. “RTX Spark marks a real breakthrough towards that vision.”

Makes me optimistic that those two companies are going to keep investing in quality local models.

Comment by aliljet 17 hours ago

The problem here is always the cost-benefit. For $200/mo, you're receiving subsidized best of breed access. There's no model competing for that price anywhere. If a 27B param model is what you choose, show me your hardware! I would love to be wrong...

Comment by rsolva 16 hours ago

But for how long? The subsidized phase is probably short, and then what? I run Qwen 3.5 27 Dense om my old AMD RX7900XTX at about 45 t/s and barely use my Claude Code subscription anymore.

Comment by robertkarl 14 hours ago

You can trade off latency / accuracy / cost for any ML task. And with the local models.... the cost is free.

Having a local Qwen check another Qwen's work increases the accuracy quite a bit at the cost of more latency. You can't have your cake and eat it too.

In benchmarking local models, I'm having success increasing even a 9B qwen's score on terminal-bench adjacent problems, just by asking it to plan and handing the plan back to qwen with a fresh context. Try it with Qwen3.5, unsloth Q4+, and a thinking budget of around 1024 tokens.

Comment by aidenn0 5 hours ago

Can anybody recommend sub $10k hardware that can run the models mentioned in TFA at something faster than a snails-pace?

Comment by kristopolous 4 hours ago

the next thing that people are going to race for is strix/gorgon halo (coming out soon). Still kind of not known.

Also the R9700 rocm is 32gb, 1350, available now. It's like 1/3 the price of what 5090s go for and you can get the slimmer models for that price so you can pack more in.

If I had to build right this second I'd do small form factor strix halo with a Radeon card.

You can get all those parts in like 3 days, msrp, no hassles. the only thing you're paying out the nose for is the ran

Good news is mobo manufacturers are adding more slots so you don't have to get robbed paying for 32 or 64gb modules

Comment by valisvalis 16 hours ago

There are good use cases for them for sure, the Gemma 4 Good hackathon a while ago showed how local models can solve problems in health and education in areas with low connectivity or small infrastructure.

Comment by LolWolf 8 hours ago

what were your favorite projects?

Comment by cautiouscat 17 hours ago

> I have no concrete scientific evidence of this - my own personal vibe metric of “is a model good enough” is, “do I have to double-check it against an API model”, and GPT-OSS was the first one where I started doing that a lot less often.

The good old butt dyno!

I’ve been eyeing local models more and more with Anthropic squeezing more and more on the subscriptions. A few comments on HN had me waiting until they improved more but this article makes me wonder if I should reconsider that.

I’ve been doing some pretty niche development using a game and a script extender for said game. If these models can handle that, I’d feel good about switching.

Comment by 13 hours ago

Comment by cube00 17 hours ago

The challenge I have is getting a large enough context window so tool calls work reliably, the local models easily slip into hallucinated JSON tool responses and won't trigger the tools as a result.

Comment by glaslong 16 hours ago

Same here. I'm curious what others loving Qwen are doing differently, because it constantly hits this issue for me. It's been great for autofilling blocks, but difficult for me to use agentically.

Comment by b3ing 14 hours ago

They are ok for simple stuff, coding is weak, chat is alright, writing is ok. But I had many of them write stories for ideas and they kept using the same names regardless of what the story was about. I can’t complain, it’s free. Can’t wait till they get even better, but for local image generation they are good, slow but just create a bunch in the background while you do other things otherwise it’s like 14.4k modems

Comment by skittleson 9 hours ago

i've been running qwen 3.6 35B A3B with llama.cpp on a 3090ti. i have found it better then sonnet in many ways. Speed and iterations was key. here is the gist of my current configuration: https://gist.github.com/spencerkittleson/5e44b6895a17ca45161... I use this with tailscale so all my devices have full access to it. That machine get toasty....

Comment by zx8080 8 hours ago

> None of these are groundbreaking tasks (again, a lot of personalized Google/docs lookups)

Does it really needs a GPU at 300Watts to do all that tasks?

Comment by daniban 16 hours ago

With Apple silicon and now the RTX Spark there are real discussions whether local AI is the future. The only problem is Western open source models are so far behind. I genuinely feel there's a push to fix this. Gemma is getting more frequent releases and Nvdia is quietly creating very cool small models. I hope both the hardware and models catch up and local really does emerge.

Comment by jszymborski 15 hours ago

I run local models and they work fine for me, but specifically for use in coding harnesses, I'm having a hard time. Tools tend to end up in the same loop, trying to `ls` the same folder or `grep` the same file, over and over and eating up the whole context. Super hard to get it to do anything but that. Any tips?

Comment by jotato 16 hours ago

I currently have a desktop with a 4060 ti (16gb of vram). Most models I have tested that fit within that are not good enough for anything other then type completion (in regards to coding tasks)

I have been considering getting the 58gb Mac Mini but that is a decent amount of money to spend without confirmation on a) how fast is it and b) will it work for well-defined tasks.

Comment by throwarayes 16 hours ago

I am happy to pay OpenAI for a cheaper model a few generations behind. But they deprecate models aggressively. They push you to bigger and smarter models, when 95% of my work doesn’t need it.

I’d love it if model providers just let old models run and let us pay less, but the deprecation makes me want to look into local models.

Comment by fridder 16 hours ago

Is there a local harness designed around the local model use case that is claude code like? Opencode has been problematic at times, pi works for one off for me but not back and forth conversations with the LLM. Considering I only use Qwen or Gemma models I'm close to just writing my own at this point

Comment by sn0n 6 hours ago

Qwen 3? Qwen 2.5 coder?? Is this an llm article written on an outdated model?? LoL

Comment by MrKoby07 14 hours ago

I think a lot of people just don't have specs like that, making it still painful.

Comment by mohamedkoubaa 12 hours ago

I wonder when a cheaper consumer grade inference chip will hit the market. The general purpose GPUs have much more silicon and complex firmware than what's strictly needed for inference

Comment by anax32 17 hours ago

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

Comment by k__ 15 hours ago

I tried some smaller Gemma4 and Qwen3.6 quants on my MBA with M5/16GB and had like 20-60 tokens per second. At 60 it felt pretty okay and that hardware is on the lower end.

I'd assume a Mac with 32-64GB memory would get some reasonable results.

Comment by ta-run 14 hours ago

Not related, but, I can't seem to get my copilot-cli (office is an MS shop) use qwen3.5:27b on ollama for some odd reason.

After the recent changes to usage, I've spent an annoyingly long number of hours trying to get this to work.

Comment by WASDx 15 hours ago

Looking at some benchmarks, the latest ~30B Gemma/Qwen score similar as Claude or GPT versions that were released just one year earlier. That's crazy progress. I can't imagine how it will be in a few years.

Comment by walmas 9 hours ago

Maybe the future isn't Data Centers, climate crisis, drought, and endless subscription and token fees.

Comment by henryoman 3 hours ago

Will there be a gemma4n

Comment by blobbers 14 hours ago

Have you tried optimizing for MLX? It seems like a waste to have neural cores and not use them.

I've often wondered why the hype around apple neural core when 99% of software doesn't use them.

Comment by genxy 8 hours ago

Yeah, first think I looked for on the post was MLX and it wasn't there.

https://github.com/ml-explore/mlx-lm

Having used half the systems that Vicki mentioned, mlx was the best balance between power and ease of use. Just a pip install away.

Comment by lthi747 12 hours ago

Maybe it is good but it is very difficult, or at least with regular computer. For users like me with 16GB laptop it is almost impossible task.

Comment by prlin 16 hours ago

If you wanted to do some research or learn about post training and agent harnesses, is that a good option with these local models? What hardware is recommended, or easiest to go with a Mac Studio with 64GB+ RAM?

Comment by wrxd 16 hours ago

I wonder how much local models hallucinate. I am getting almost daily an "Honest answers: I made that up." reply from Claude Opus when I challenge some silly thing it's trying to do.

Comment by stared 17 hours ago

I really recommend Qwen3.6 27B.

Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...

When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.

Comment by iagooar 16 hours ago

I run the exact same model, on the exact same hardware - amazing results. Pair it with good search skills (Tavily, Brave, Exa) and you have a near-SOTA model on your desk.

Comment by wizzledonker 17 hours ago

Did you mean 2025?

Comment by stared 17 hours ago

Yes, fixed

Comment by Patchistry 4 hours ago

do you run you local models along side some of your "paid" models?

Comment by malkosta 16 hours ago

The problem with QWEN is that it just can't edit files reliably, I had to hack Pi all over to reduce the pain, but still far from perfect...does Gemma 4 strugle on this?

Comment by nikagrawal121 13 hours ago

I tried for my legal AI application that I'm building and it was able to do majority of the tasks. I used gemma4:26B

Comment by ibizaman 17 hours ago

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

Comment by bthornbury 15 hours ago

the qwopus 27b model is good for grunt work style tasks, even across multiple files. Piping a bunch of things through, small factoring changes, stuff that just takes time to type out.

I wouldn't rely on it for large stuff like codex though. I haven't tried out deepseek/kimi, if we could run those locally it would be great.

Comment by ridruejo 15 hours ago

Local models are one of the main drivers for our installer / Desktop app for OpenClaw https://holaclaw.ai (disclaimer I am one of the founders). The smaller models are really only suitable for the most basic tasks, but if you have 32gb-64gb you can get real work done (ie complex web workflows) without third party hosted models

Comment by xienze 17 hours ago

The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.

The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.

Comment by osigurdson 14 hours ago

Running AI on timesharing mainframes does seem like an odd final state for the world.

Comment by fl4regun 16 hours ago

In my experience, with a system of 32GB RAM and 24GB VRAM, no, they aren't that good.

Comment by 0xbadcafebee 13 hours ago

Local models have been good for a while. But this being the HN echo chamber, people here think that local models can only be used for coding, and are expecting Opus 4.8 on their iPhone. Turns out AI can be used for things other than just coding. Even tiny models (<4B parameters) can do tons of useful things on local devices. Search, index, summarization, troubleshooting, crafting documents/formatting, image analysis, transcription, object identification, robot navigation, text-to-speech, speech-to-text, browser/window control, MCP/tool calls, and much more.

Larger models just do more complex reasoning. But if you want them to be really good, you need a beefy Mac. They have the best combination of memory bandwidth and RAM to allow medium-sized models to run at speed. GPUs have less memory but more bandwidth, and AMD iGPUs have more memory but less bandwidth. The Mac is the best compromise on the market today.

Once you do have a beefy Mac, you want to run a dense model. This gives you the best possible result with the system you have. You can go MoE for faster results, use cutting-edge inference techniques, parameter tweaks, etc. But a basic dense model (at Q6 quant) on a big-ass mac will serve 90% of your coding needs.

Comment by wasimxyz 17 hours ago

https://canirun.ai

Comment by frollogaston 14 hours ago

"Good" refers to the speed and not the quality. There's so much hype about Macs being great for LLMs, but nobody seems to be seriously using them for that because the open models are unfortunately so far behind.

Comment by drchaim 16 hours ago

really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/

Comment by tennfown 15 hours ago

I have some decent specs, but I’m stuck with AMD graphics card which I’ve been told is a non-starter

Comment by atulmy 14 hours ago

Exact reason I'm building csuite.so, do check it out and let me know if you need early access!

Comment by aleksandrm 11 hours ago

Clickbait title, because running local models is still not good now.

Comment by dakolli 7 hours ago

It doesn't make sense, if your small local model is 75% as effective as a frontier model and frontier models are still what.. 50% effective maybe slightly more, with tons of downsides.. Why would I spend 5k on hardware to run these mediocre models. I don't really see the point in the frontier model either.

Comment by dakolli 7 hours ago

Imagine spending $5k to run a 32B param llm locally.. You could run much more capable open source models through Openrouter for years running 24/7 at 50tps. This will never make sense to me.

Comment by matrix12 12 hours ago

gemma:12b at 75% of frontier? Yeah....

Comment by etoxin 1 hour ago

I think 75% is about right. It calls tools pretty well and has a good knowledge base. It's absolutely not 90% there, but 75% feels right.

Comment by Computer0 10 hours ago

I have 16GB VRAM and 96GB Ram on all my computers and I do enjoy local models. I would not use them for coding, though I have experimented with it, it is largely a waste of time on my hardware. I love local chat with different models however, when using the model in this way it is much easier to experiment with the largest models near the limit of your hardware, and I do find it useful on the airplane somewhat. I have also used local models for data classification tasks and let it run over the weekend etc and the results were acceptable.

Comment by ZionBoggan 16 hours ago

This is actually a really insightful post !

Comment by Mr_Eri_Atlov 12 hours ago

I think this is a pivotal moment for LLMs.

Gemma 4 and Qwen3.6 27B aren't perfect, yet they are such a step forward from the previous generation that it's both feasible to get stuff done locally with patience and very likely that future releases will subvert cloud capabilities entirely.

Plus, they have definite reliability advantages over cloud models that can be wiped out by a government order or lobotomized to handle traffic surges.

Comment by jmyeet 13 hours ago

It's not "good". A more accurate description would be "sometimes useful and not far from being good". The author is using pretty small models. There have been a lot of improvements that scale in any case (eg MTP) but ultimately this is still hardware limited by 3 factors:

1. Memory bandwidth

2. VRAM size, which limits the size of a model you can use effectively. Yes you can swap but then you're taking a performance hit;

3. Raw FLOPS, including quantization.

Apple here is interesting because they have a shared memory model and you can buy Macs currently with up to 128GB of RAM (previously 256/612GB on Mac Studios, both discontinued). New M5 Mac Studios are expected in Q3 but that's not guaranteed. It may take until next year

Depending on the chip, Macs top out at ~900GB/s. A 5090 or 6000 Pro has 1800GB/s. A B100 is at like 3.2TB/s. A 5090 has, depending on how you count, 5-7x the FLOPS of a M5 Pro so a 5090 is still better than any current Max... except for the 32GB limit.

NVidia aggressively segment the market by limiting VRAM. The RTX 6000 Pro is basically a 5090 with slightly more CUDA cores and 96GB of VRAM instead of 32GB for $10-11k instead of $3k.

So let's project this into the future a little. The M6 Ultra/Max may well be 1TB+/s memory bandwidth with much higher FLOPS and thus actually be competitive for larger models. A 6090 in the current market will probably still have 32GB of VRAM if I had to guess. Maybe it goes up to 48GB.

But anyway I think we're only 2-3 years away from sub-$5000 hardware that does 100-300+tok/s on models larger than 31B. And that's going to be a game changer.

Comment by jingw222 15 hours ago

open source must win

Comment by jauntywundrkind 13 hours ago

i'd love to get to a point where big models can launch subagents that are fast and local. there's a lot of focus on token rate, but just as much, the way cloud providers have other latencies & processing styles not optimized for latency (running large batches all at once), and i think local might have some real wins. Gemma 4 seems already on the right track. lfm2.5-8b-a1b (https://www.liquid.ai/blog/lfm2-5-8b-a1b) and DiffusionGemma seem to both be very high token rate. but getting that latency down, so that a series of tool calls can happen faster, would be a real win. I think especially with good prompting that becomes much more possible.

One caveat, I have absolutely no patience for a lot of subagent systems, like opencode, where the subagent is walled off and incommunicatable. My subagents really should be their own session, that i can deal with as I please, with some MessageChannel like offerings/tools available to them. Ideally with modes where messages auto-flow in and out, and modes where I can be a gate-monitor. https://developer.mozilla.org/en-US/docs/Web/API/MessageChan...

Not really super related but MCP has been working on Events for a while. That ability to respond fast would be great. https://github.com/modelcontextprotocol/experimental-ext-tri...

Asking local to be fast feels like an obvious folly, but given how much better small models have got, and seeing these models tune themselves for speed: I want to hope!

Comment by monegator 16 hours ago

I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.

So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.

I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:

At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)

Wish i had 3 times the RAM so i can see what happens with more context.

Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.

This was the Qwen 3.5 9B model.

I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..

Comment by holoduke 15 hours ago

Good? My Macbook m3 with 36gb locked up after it filled all memory with Gemma4. A bit useful yes. But it eats all resources. For local models to be useful we need at least 128gb of system memory and 512gb of video memory. Plus 8 times the compute of a single 5090/h200

Comment by jkwang 19 minutes ago

[flagged]

Comment by pcell 5 hours ago

[flagged]

Comment by hottrends 10 hours ago

[flagged]

Comment by Littice 9 hours ago

[flagged]

Comment by aplomb1026 14 hours ago

[flagged]

Comment by eugmai86 14 hours ago

[flagged]

Comment by kordlessagain 17 hours ago

[dead]

Comment by 14 hours ago

Comment by RishiByte 14 hours ago

[flagged]

Comment by Veer_Pratap08 16 hours ago

[flagged]

Comment by maxothex 16 hours ago

[flagged]

Comment by mrkn1 11 hours ago

[flagged]

Comment by azzzxcc123 15 hours ago

[dead]

Comment by huflungdung 15 hours ago

[dead]

Comment by Rekindle8090 15 hours ago

[dead]

Comment by Lapsa 13 hours ago

[dead]

Comment by iluvcommunism 17 hours ago

[dead]

Comment by zrg 1 hour ago

tldr it is not

Comment by fg137 16 hours ago

> I have a 2022 M2 Mac with 64 GB RAM

I closed the article after that.

The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.

Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.

Comment by orf 16 hours ago

99% of the population don’t code using models, local or remote. So that’s a useless metric.

What % of developers could afford an older MacBook model, second hand? Far, far more than 1%.

Comment by DiabloD3 3 hours ago

But why would developers _step down_ to a Mac?

Comment by fg137 11 hours ago

could or will?

I am pretty sure even among software engineers, much fewer than 1% are going to spend their money on that.