Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Posted by cloudking 1 day ago

Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)

Comments

Comment by Greenpants 1 day ago

I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.

I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.

It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).

Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)

Comment by lambda 1 day ago

This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.

I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.

And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.

But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.

For other chat tasks and translation, I'll frequently use Gemma 4 31B.

For audio, I'll use Gemma 4 12B.

I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.

Comment by chakspak 1 day ago

Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

Comment by lambda 1 day ago

I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.

The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.

But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.

Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.

In my models.ini, I have this for the Qwen3.6 models:

  chat-template-kwargs = {"preserve_thinking": true}

There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.

Comment by ndom91 1 day ago

+1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases.

I'll have to give the preserve_thinking a shot.

Comment by jderekw 1 day ago

Thanks for sharing have been running ROCm primarily with Qwen 3.6 and Qwen Coder, on the runs much better statement is that a stability, performance or other capability your experiencing?

Comment by thefroh 1 day ago

I'm a little surprised that preserve_thinking would matter here for cache purposes. for actual capabilities/intelligence, yes, I'd imagine it helps to have past reasoning traces in multi-turn setups.

but for caching, all you are doing is leaving off a fraction of the most recent assistant message generation, which will have little/no impact on cache hit rate.

Comment by stymaar 1 day ago

> all you are doing is leaving off a fraction of the most recent assistant message generation

True, but not a tiny fraction, qwen is very verbose in its thinking traces. And it basically means that for every (nonthinking) generated token you have to compute the KV twice (once as tg, the second one as pp).

Comment by havfo 1 day ago

I was able to solve this for my setup, 7900XTX and llama.cpp on ROCM in the oh-my-pi fork of pi.dev harness. I documented my setup on github, check under my username/omp-config, but the important thing is making sure the context is strictly append-only, and starting llama.cpp with

  --chat-template-kwargs '{"preserve_thinking":true}'

Comment by anaisbetts 1 day ago

If you're hitting this you have a bug, this is not related to the model. Either your harness is editing the messages between turns incorrectly (i.e. it is not append-only), or sometimes this is because of llama.cpp bugs, but bet on the former. Setting up something like Tailscale's Aperture will let you capture the requests and then you can diff them.

Comment by dnautics 1 day ago

> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?

Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?

Comment by lambda 1 day ago

So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.

Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.

Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.

But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.

So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.

There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.

Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.

Comment by carterschonwald 1 day ago

thats a harness issue not a model issue. eg i have my own reasoninf harness that forced persisted cot

Comment by thefossguy69 1 day ago

Would you mind sharing your harness for reasoning?

Comment by lambda 16 hours ago

Not a harness issue. The harness (pi in my case) passes back the cot for all previous turns.

The jinja template is what renders the openai-format request sent by the harness, into the actual string of text that will be tokenized and fed to the model. For models without preserve thinking support, the jinja template drops the reasoning from all but the current turn.

Here is the default jinja for Gemma 4: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_...

    {#- Render reasoning/reasoning_content as thinking channel -#}
    {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
    {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
        {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
    {%- endif -%}

You see that it only preserves the thinking for indexes that are later than the last user message; thinking is only preserved for a single turn (which can include a lot of interleaved thinking and tool calls), once it goes back to the user and the user replies, it will replay the tool calls but not the thinking between them.

Here's Qwen 3.6 by comparison: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/chat_t...

        {%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}

It additionally has a preserve_thinking flag that you can set. If that's set, it will include all turns thinking in the text passed to the model. But you do have to set that, it's not the default.

It's possible to modify the jinja file that you're using with a model. Some people do that with models that haven't been specifically trained for it, and report good results; but some report that because it wasn't trained for that, they get worse results if they include thinking from previous turns.

So for models like Gemma, you would have to modify the default jinja to enable this. For Qwen, you can just set the preserve_thinking flag to get this behavior; and apparently they have trained in this mode so you get better results than models that have not trained this way.

Comment by dnautics 1 day ago

wait do sota models use mamba-like SSMs? this is the first im hearing this

Comment by nl 1 day ago

Qwen 3.5 and above use Gated DeltaNet which alternate attention and SSM layers:

https://sebastianraschka.com/llms-from-scratch/ch04/08_delta...

Comment by LoganDark 1 day ago

What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.

I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)

Comment by mbitai 22 hours ago

I've also had good results with Pi, and I got used to the new workflows without subagents, MCP, etc.

Comment by advertum 17 hours ago

[flagged]

Comment by verdverm 1 day ago

There is a bug in llama-cpp for qwen/gemma models, use vLLM instead

Comment by pdyc 1 day ago

what bug and it affects what?

Comment by verdverm 17 hours ago

it's a prompt cache invalidation bug that causes all input to be reprocessed instead of getting preloaded

There are other reasons to prefer vllm to llama-cpp as well

Comment by fjdjshsh 1 day ago

>I'm still a AI skeptic

What does this mean in June 2026 wrt coding?

To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.

Comment by svantana 21 hours ago

I'm a housekeeper skeptic. While I concede that a professional housekeeper would probably do a better job than me on most domestic tasks, I still think everyone should clean their own home, cook their own dinner, and write their own code.

Comment by femto113 1 day ago

For me the distinction is that your rice only needs to be edible once, while your code may need to last for decades. Using AI to code anything I could comfortably throw away if needed is a lot less fraught than letting it make choices that I and anybody who inherits the code is gonna have to live with, especially if by outsourcing those choices I reduce my understanding of the implications of those choices.

Comment by deadeye 20 hours ago

I don't let the AI make any choices. I have a lot of instructions and sample code for it to follow. It is basically a glorified code generator at that point.

Comment by luipugs 1 day ago

Don't you read through all the output of the agent before committing them?

Comment by secult 1 day ago

That's not the way how human brain works.

Comment by luipugs 22 hours ago

I'm not getting it. OP said they are wary of letting the agent make choices for them, and outsourcing those choices lessens their understanding of them. They could interrogate the agent on why those choices were made until they have sufficient understanding, and they can also change the solution if they want to.

Comment by incrudible 23 hours ago

I think the idea that code should last decades is now questionable, if not problematic. If we can now produce code at 10x the rate, that means we can have 10x more code (probably not desirable) or we can have 10x as many revisions. Whoever inherits the code can have it rewritten to their liking and understanding. Nothing helps better in understanding a system than to rebuild it, even if just by handholding an LLM.

Comment by bluGill 19 hours ago

Only for simple problems. As the problem becomes complex you can't remember all the requirements to prompt the AI with.

Comment by incrudible 17 hours ago

As the problem becomes complex, you can't remember all the requirements, period.

Comment by bluGill 17 hours ago

Exactly, but if I start from working code with a lot of tests I don't need to remember the requirements. I just need to know my current requirement and figure out the ones I'm changing with my new requirement. It doesn't catch everything, but in most cases if I break some other requirement I find out about it and can figure out just that one more requirement and not the millions of others that still work.

Comment by sfn42 22 hours ago

The thing about this is that you can choose how high level you go.

For example you can just tell it to make a website for a business with a webshop and it'll just generate thousands of lines of code and you have no control over anything. Or you can spend hours/days writing the specification and then have it generate it.

Or you can do what I do and work iteratively one feature at a time making sure everything is exactly the way you want it. I generally solve the problem myself then tell it what to do, or if I'm not sure what the best solution is I might discuss with the AI until we agree on a plan and then have it execute it. Often this leads to me learning useful things, like it will suggest a tool/feature that I didn't know about that's perfect for my usecase or it will identify a problem in my plan that I wouldn't have found until after spending hours on the implementation.

I've always been very detail oriented and I care a lot about code quality, I want my solutions to be clean, consistent and as simple as possible while solving the problem. To me, AI tools let me do that more quickly and better, it's not a compromise it's just flat out better in every dimension. It's about how you use it.

A lot of people seem to think that it's a binary choice, either hand craft a high quality bespoke solution or just vibe code a pile of trash. There's a whole spectrum in between those two, and I think there's a sweet spot where you still maintain control and understanding, it's just much faster and the result is actually better because it's not just you and the knowledge in your brain it's also the AI that practically knows everything - it will teach you things and suggest solutions you wouldn't have thought about, it makes you a better developer. It's a force multiplier and the smarter you are the better you will be at using it.

It's not a replacement it's an enhancement. It's like imagine a developer with Google vs one without, obviously the one with Google will be better because they have access to more information. The AI is like automatic google that just googles everything all the time, things you wouldn't have even thought to Google or things you couldn't possibly formulate a good search term for. With AI you can just show it a screenshot or describe an issue in detail and get a really solid answer a lot of the time. It's like having an expert on standby all the time, sure it's sometimes wrong but most of the time it's not and if you're smart you'll recognize when it isn't.

I'd say anyone who isn't using AI today aren't using their full potential. I don't see how anyone could possibly perform better without this tool than with it. I do see how someone who doesn't care could produce a lot of slop, but the people who refuse to use it aren't that guy. That guy has been using it to produce slop for years already. You can use it to produce top quality code if you choose to.

Comment by HWR_14 1 day ago

I assume it means they are not sure it gives them a speed up. Which, since I don't know what they are trying to do, may be reasonable.

Comment by Iolaum 1 day ago

Haven't used for actual coding but was testing locally - for example running some swebench instances - whether qwen-3.6-35b-a3b@Q8 was better than qwen-3.5-122b-a10b@Q4. With MTP the former runs at around 55t/s and the latter at around 30t/s meaning the latter is also usable. It looked like qwen-3.5-122b-a10b@Q4 performed a bit better.

Comment by mahadevank 1 day ago

Thanks a lot for your comment. I was using Qwen3 but asn't aware ofo the A3B Mixture-of-experts model. Works much better, thanks

Comment by adyavanapalli 1 day ago

For the edit tool, you should consider implementing a hash-based approach where each line of code is hashed and referenced by it when doing replacements. You can read up on the approach here: https://blog.can.ac/2026/02/12/the-harness-problem/

I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV

Comment by pieterk 1 day ago

Yup, I used this for a while and IME it may get you a few percentages more of useful context initially, so quality feels a bit higher, but things start breaking down in funnier ways when you do run out of that quality for any reason later, so definitely caveat emptor.

Comment by ojr 1 day ago

I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB, the price for privacy is very high. Agentic flows that get stuck can be worked around but I prefer developer velocity.

Comment by ClikeX 23 hours ago

> the price for privacy is very high

Not sure if you intended this to be this philosophical, but this is basically the slogan for modern life now.

Comment by gwerbin 20 hours ago

Yeah but the price for, say, private email is a lot less.

Comment by disqard 1 day ago

Under-rated take, thanks for stating this!

Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.

Comment by tpm 1 day ago

It's ok if you can send your code and data to the provider. Some of us can't.

Comment by _zoltan_ 1 day ago

We're discussing home use.

You can. You just don't want to. Huge difference.

Comment by monooso 21 hours ago

> We're discussing home use.

You may be, but the topic of discussion is whether anyone is using a local model as their main coding tool.

Comment by _zoltan_ 20 hours ago

for corporate use it's a mistake not to use a frontier model.

Comment by tpm 18 hours ago

Well plenty of people work from home.

For corporate use, if the corporation would break the law sending anything to the open internet or to the US, then you can't use any model that's not hosted in house. And there are many such cases.

Comment by ihateolives 19 hours ago

Sure, but Gemini subscription gives you just that - Gemini subscription, but new computer allows you to do other stuff with it as well. When you're upgrading anyway for other reasons then it's not fair to compare full Studio price to just one subscription.

Comment by danans 1 day ago

> I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB

And sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.

Comment by electronsoup 1 day ago

> It gets into loops quite often, and surprisingly often gets the edit tool call wrong

I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn

Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off

Comment by girvo 1 day ago

Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.

Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!

One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.

There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!

Comment by gwerbin 20 hours ago

Do you think the choice of quantization matters that much for other models? I've seen a lot of discussion about different quantization and FP formats but I feel totally unequipped to make an informed decision about what to try.

What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

Comment by girvo 10 hours ago

>What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.

That is quite literally what I have setup :)

I have a few codebases I've written over the years that I attempt a suite of specific tasks: code analysis/bug finding, bug fixing, adding features, that kind of thing. I keep track of the results, including wall clock time

>Do you think the choice of quantization matters that much for other models

It hugely matters. Lots more than r/LocalLlama would have you believe, sadly. Some model architectures can handle more aggressive quantisation than others, and it's hard to know ahead of time.

Step handles it surprisingly well (sparse MoE models seem to generally, when the particular layers are chosen to be quantised carefully). Qwen 3.6 27B handles it okay, but FP8 was better... except annoyingly Qwen's official FP8 has worse KLD/perplexity numbers/accuracy than it otherwise should. RedHat's one was better in my testing, though not by a huge amount.

Comment by rhdunn 18 hours ago

I use promptfoo for evaluation. I'm experimenting with tests for my workflow/use cases.

I have a custom assert for loop/repeat detection that works well:

    def count_repeats(text: str, length: int) -> int:
        n = len(text)
        pattern = text[n - length : n]
        count = 1 # Include the end of the string as matching the substring.

        text = text[: -length]
        while text.endswith(pattern):
            text = text[: -length]
            count = count + 1

        return count


    def repeats(output: str, context: dict[str, any]) -> bool|float|dict[str, any]:
        threshold = context.get('config', {}).get('threshold', 3)
        count = 0
        length = 0

        for n in range(1, (len(output) // 2) + 1):
            n_count = count_repeats(output, n)
            if n_count > count:
                count = n_count
                length = n

        if count >= threshold:
            return { 'pass': True, 'score': 1.0, 'reason': f'Output repeats {count} times with length {length}.' }
        else:
            return { 'pass': False, 'score': 0.0, 'reason': f'Output doesn\'t repeat {threshold} or more times.' }


    def no_repeats(output: str, context) -> dict[str, any]:
        result = repeats(output, context)
        result['pass'] = not result['pass']
        result['score'] = 1.0 - result['score']
        return result

Just add it to your promptfooconfig.yaml:

    defaultTest:
      assert:
        - # ----- The output doesn't repeat/get stuck in a loop.
          type: python
          value: file://asserts/repeat.py:no_repeats

Comment by ttoinou 23 hours ago

I tried Step 3.7 Flash on my mac 128GB and it seemed very dumb. antirez ds4 flash is much better !

Comment by girvo 20 hours ago

It isn’t though, I’ve run both through a bunch of coding evals. You nearly certainly didn’t have the right sampling parameters or quantised the KV cache?

Ds4 is impressive for what it is, but it loops and over thinks even more, burning massive wall clock time to not even get great outcomes. It’s also limited to a slow speed on my Spark

Comment by ttoinou 20 hours ago

I tried a bunch of stuff with step 3.5 and step 3.7 maybe not as much as you. Could you tell me what parameters and launched you’re using ? Antirez ds4 flash q2-q4 works almost out of the box for me

Comment by girvo 10 hours ago

To be fair: if you're happy with ds4 then IMO stick with it!

Step 3.7 is notably better than 3.5

1. Use the official StepFun GGUF, IQ4_XS - theirs is better tuned in my experience than the other quants

2. Temp 1.0 top_p 0.95 sampling parameters for reasoning/agentic coding

3. It's really quite important that you don't quantise the KV cache: it made a surprising amount of difference to the looping and over thinking I found, at least for the quantised version of the model. I'm using the full F16 for K, and Q8 for V

4. Note that it now supports `reasoning_effort: low|medium|high` in your chat_template_kwargs; this is super useful :)

Comment by kristopolous 1 day ago

I've got a tool that sits in between the harness and inference engine called petsitter. It is a middleman validator to avoid just these kinds of issues. You can stack the fixes as needed (they're called tricks in the petsitter parlance)

It's what I use. Fixes the problem

https://github.com/day50-dev/petsitter

Comment by stared 23 hours ago

Why I do like Qwen 3.6 35B A3B, I have found that the difference improvement of Qwen 3.6 27B is massive. Sure, it is 3x slower (https://github.com/stared/benching-local-llms-on-apple-silic...), but for the total development time it felt that still 27B is faster to get the goal.

Is it that in your case is it different?

Comment by ltononro 1 day ago

What kind of coding do you do? Do you keep track of frontier models to vibe check the differences and re-evaluate constantly or are you ok with having a nerfed model forever? (not being judmental, just really wanto to know your framework here)

Comment by Greenpants 1 day ago

Some of the work I do, I do for an (EU) organisation that doesn't have clear rules or guidelines on the use of AI yet. Though I have seen colleague-developers blatantly putting source code into external Claude-like models, I stay true to my principles and don't. I know for certain that everything that I run through my local, offline Pi Container Sandbox cannot leave the machine, and thus can't result in a data breach. I do this for the peace of mind.

I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.

Comment by ltononro 19 hours ago

So you don't really trust the data policy (non-retention) of the big companies like Anthropic/OpenAI + regulations in EU. This is very interesting. I myself have been blindly trusting these organizations with my data and still not sure if I am trading code/trajectories for productivity.

Another POV is that most of the code written in most of my codebases were generated by Codex/Claude, so they would be "stealing data from themselves" in a sense.

I've been working with Transformers/LLM training in 2018-2021 and then now, more recently again. Things are far different. I think they would be more interested in the "how" you got your code to be satisfactory with your guidance than the actual code generated. But mostly I personally trust that they are not really using my trajectories for that (unless I explicitly allow it in the configs)

Comment by kordlessagain 1 day ago

I'm adding Pi to Nemesis8 right now because I saw your comment, so thank you!

https://github.com/DeepBlueDynamics/nemesis8

Comment by dumbfounder 8 hours ago

It’s just a SaaS service like any other. They all want to use your data, but there are terms to make sure they don’t.

Comment by psychoslave 1 day ago

Could you give more details on how to make such a set up?

I'm not familiar with Pi, and not sure which kind of container you are referring to. Something mainstream like docker, or more classic like a BSD jail?

I started to experiment with locale LLMs, through ollama and Lemonade. Enough to throw simple prompts with code excerpts and get small scope code refactors. Though I still struggled to make them work with external tools, like my IDE, so they can be leveraged on to an agentic level with access to a full repository.

That's mainly for work, as they push for using LLMs, though with the new copilote license they provide it doesn't take me even a week to burn the whole token credit.

The tool can be useful, but in my experience without heavy guard rails and loops over tests. I suspect late models to also burn many token into rabbit hole of nonsense hypothesis, instead of doing straight forward correct implemention as you would expect from any entity with such a huge cumulated resources eaten and experimental playground to leverage on. Maybe incentives don't help model provider to minimize sold token, maybe it's just so hard to tame the beast all these bright minds with virtually infinite resources are not good enough.

Anyway, sorry for digression, but I would be extremely interested with a step by step tutorial to make a local LLM work in agentic level, including which kind of hardware is required to make it work properly.

Comment by geophile 1 day ago

My experience is almost identical. I have found that I need to be very careful with planning, breaking things down into small isolated steps (I can have qwen do this); and also (me) writing a very clear design. Relying on qwen to fill in a lot of those precise details results in those about-to-write loops.

Yeah, that edit inability is weird. I’ve updated AGENTS.md to limit editing (as opposed to rewriting) and that helps a little.

Comment by westoque 1 day ago

> Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture.

that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.

Comment by physix 1 day ago

The dilemma I am facing is cost.

Consumer-grade subscriptions of the frontier models give you superb capabilities per dollar, them being heavily subsidized. But if you're working in an enterprise setting, that won't work. You need to upgrade, and that gets significantly more expensive.

Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).

So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.

Comment by willisrocks 1 day ago

Or you can get the best of both worlds--use frontier models to build a spec/plan, and use cheap models (open source or not) for implementation. Your max or team plan can go a lot further this way without giving up much for quality. Play with something like Superpowers to make this really approachable.

Comment by bxk76 1 day ago

Best insights can be over rated due to bandwith limitation of the brain. Even if Einstein is sitting next to you the whole day and helping out Theory of Bounded Rationality applies.

Comment by robertlagrant 23 hours ago

How are you sandboxing your Pi coding harness? Directly only mounting certain folders, using capabilities to kill the network and not giving it all your shell env vars, that sort of thing? Or do you use a tool?

Comment by throw10920 18 hours ago

And, is the sandboxing for security (avoid RCE on the host) or merely guardrails for the models?

I've wanted the latter quite a bit for Pi, because weaker models like Deepseek V4 have extreme issues with obeying prompts (e.g. I'll instruct it to find a bug but not fix it, and it'll "helpfully" try to fix it anyway), so having a "read-only mode" actually backed by the OS would be very useful.

Comment by SeriousM 12 hours ago

Haha, yes! Last time I asked it for options how to tackle a task and only do the research without touching any code. With xhigh rasoning, it echoed the options that many times until it was convinced that option A is the better choice and started implementing it.

Comment by pieterk 1 day ago

Yup, it's fantastically useful.

Maybe even more useful than Opus when I have all the constraints to an issue. There is less "knowledge" in the model (I get by with 48GB of RAM allocated to an 8b quant), so it has fewer things to hallucinate about.

I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.

It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.

*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.

Comment by gwerbin 1 day ago

I've noticed the same about the edit tool, in both Gemma and Qwen. Maybe I'm not running them with the right sampler settings, but I'm happy to hear I'm not the only one. Lots of mismatched whitespace and stuff, the model ends up doing hex dumps and maybe 5 or 6 attempts at editing a 5-line function into a 250-line Python file.

All of these models also seem to get stuck in long thinking loops, sometimes tripling the tokens of a frontier closed model which is really painful when inference is already on the slow side (on my Macbook).

Comment by 0xbadcafebee 1 day ago

The harness and the LLM parameters are pretty essential to getting better results and reducing loops. Tweak the parameters and you can mostly eliminate loops without negatively affecting performance (it's a bit complex but ask a SOTA AI to guide you and it's not hard). The harness should also react more intelligently to failures; it can do things like return additional context or hints as it tracks error rates and avg duration of calls. Pi can be easily extended, and it's suggested by the author you modify it to perform better for your use case.

Comment by hparadiz 1 day ago

I am right there with you. Mind-boggling. It's a indistinguishable from magic technology!! I tried running some basic tasks through Qwen with Opencode on a 10 year old dual Xeon server for shits and giggles. I gave it a simple task like "use ffprobe first but convert this webm to mp4" and it was able to complete the task with zero network calls outside my network. On 10 year old hardware. It took about 3 minutes to complete the task. Now you may be saying 3 minutes? pfft. But I dare you to do it yourself. You're gonna be googling the CLI switches for at least 10 minutes and setting up your command. I had it actually optimize all the switches on the fly for me based on an initial ffprobe to see what is optimal.

Comment by bluerooibos 1 day ago

> 10 year old dual Xeon server...On 10 year old hardware.

Hold on, what are the specs of your rig? How much RAM?

I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.

Comment by linzhangrun 1 hour ago

No need to touch the Macintosh from the X86 era

Comment by hparadiz 1 day ago

I inherited a box with dual Xeons and 256 GB of DDR4. I then ran several tests and benchmarks of the hardware with several models.

I've been meaning to write a blog post but well whatever here's the md.

https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...

Qwen3.5 9B performed best.

You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.

Comment by bandrami 1 day ago

> You're gonna be googling the CLI switches for at least 10 minutes

So there's this really amazing program called "man"

Comment by hparadiz 23 hours ago

Yea there's something called a phone book too.

Comment by bandrami 23 hours ago

And that would be a much better source for a phone number than Googling. Similarly, the docs that ship with software are a better source for command line switches for that software than a search engine or LLM.

Comment by hparadiz 23 hours ago

My lived experience right now is a lot of super talented people around me using these tools all day every day to build awesome things and then there's the randos like you on HN who think they know better. Protip: You don't know squat.

Comment by cruffle_duffle 19 hours ago

The docs that ship with it are a great source for the LLM who will be running the command and monitoring its output, fixing or adjusting whatever in order to complete my goal. Why on earth would I be calling it by hand?

Comment by bandrami 7 hours ago

> Why on earth would I be calling it by hand?

So that it's done correctly instead of wrongly

Comment by hparadiz 4 hours ago

pictured

> man plows field by hand

Comment by gmac 23 hours ago

Which is generally slower than Googling, because it's paged content in a terminal which can search only for literal strings?

Comment by bandrami 7 hours ago

This was true a decade ago; unfortunately Google has become more or less useless for most purposes at this point

Comment by ololobus 19 hours ago

You are right, but I think you miss the whole point of the agentic workflows that are being discussed in this post comments.

Yes, you surely can read man, docs, whatever, then DIY. The point is that in many areas people don’t really want to become an expert, like in ffmpeg cli arguments, they just want the work to be done. Above is an example of agent being able to do it locally, and I think it’s great

Comment by MoonWalk 16 hours ago

This is good info, thanks. I want to do something similar, but know very little about how to set the components of LLMs up.

I've read a bit on what the various components are. What I don't see in your comment is what you're using to run your model locally. Ollama?

Comment by dotancohen 1 day ago

  > you really need to know what you're asking, and be precise

Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.

Thank you.

Comment by Greenpants 1 day ago

I'm glad you're asking. I already started writing a blog post on how to best make use of local models. I'll share it as soon as I have a complete enough list. If anyone else reading this would like to chime in with their tips & tricks, let us know!

For the time being, off the top of my head, I'd say:

- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).

- If you already know which files the agent should look into, mention them to save time and potentially context.

- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.

- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.

Comment by thefossguy69 17 hours ago

Is there a way to be notified of your blog post on this?

Comment by dotancohen 21 hours ago

Thank you, that was extraordinarily helpful.

I look forward to that blog post!

Comment by tsss 19 hours ago

But if you have to write everything down in such detail, isn't it faster to just do the task yourself?

Comment by jmuguy 1 day ago

Given your knowledge on this - do you think we'll see an open source model with Opus levels of capability? IMO if/when this happens - I would 100% stop using Anthropic.

Comment by Greenpants 1 day ago

Let me put it like this. I started with local LLMs when ChatGPT still used GPT-3.5. I was amazed how my MacBook with 8GB RAM could run openhermes2.5-mistral: a 7b parameter model that could generate short stories that sort of made sense. Incredible!

Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.

I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)

Comment by jmuguy 1 day ago

Haha well I ask because I don't really want/need anything beyond Opus most of the time. And I'm paranoid that Anthropic is going to be forced to charge the true cost of all this before too long.

Comment by Greenpants 1 day ago

The other upside of running local LLMs is that there's no cloud provider to suddenly charge more for the same, or even less, model use.

It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.

Comment by lambda 1 day ago

If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.

Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.

It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.

But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.

Comment by MrScruff 1 day ago

You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.

Comment by lambda 1 day ago

Which Opus? They certainly outperform Claude 3 Opus.

Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.

Comment by mapontosevenths 1 day ago

There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.

I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.

Comment by lambda 1 day ago

OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B.

Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193

Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215

Qwen 3.6 produced far more working functionality than Claude 4 Opus did.

Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.

Comment by MrScruff 1 day ago

I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.

Comment by lambda 1 day ago

Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.

Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.

Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.

It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.

Comment by shimman 1 day ago

Agreed, but at their current prices Deepseek + GLM are clear winners in my book. This weekend I spent $5 between the two where as I'd probably have to pay $20-30 to Anthropic (and that's still with the massive VC subsidies).

For web development (or anything else with an extreme amount of training data) it's number one for sure. You can't beat it at its costs. US companies will not be able to compete on a competitive market, which is why they rely on so much US government protection + corporate welfare.

Comment by make3 13 hours ago

There is no Claude 4 Opus model... It's a series of model, of which the strongest is Opus 4.8, and Qwen 3.6 35B-A3b gets 51.5% on Swe-bench pro to Opus 4.8's 69.2%

Comment by lambda 9 hours ago

Huh? There is a Claude 4 Opus. It was released about a year ago. It is retired by now, in fact, just retired yesterday: https://platform.claude.com/docs/en/about-claude/model-depre...

But it is still available on Google Vertex according to OpenRouter (though it's possible that info is just out of date, it's currently quoting 3tps which is unusably slow): https://openrouter.ai/anthropic/claude-opus-4

Comment by zozbot234 1 day ago

People can't seem to agree on what "Opus class" even means (the latest Opus is apparently pretty weak) but DeepSeek Pro, Kimi and GLM all are quite capable.

Comment by computerex 1 day ago

Nothing compares to Opus when it comes to "taste" in web design in my experience. Nothing compares to opus in very difficult HPC/model inference development. I worked on this with opus: https://github.com/computerex/dlgo

OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.

Comment by lambda 1 day ago

Which Opus?

Anthropic has been releasing models named Opus since 2024 with Claude 3 Opus.

Opus has gotten vastly more capable since then.

Local model far surpass Opus 3. They even surpass Opus 4 on most benchmarks.

Sure, if you compare to the latest Opus 4.8 or even 4.6, they're not there yet. But there's a huge difference in performance between 4 and 4.8.

Comment by jkells 1 day ago

Can't speak for anyone else but there was a step change in frontier models last November. Opus 4.5 and GPT 5.2 I think.

When I colloquially say Opus level I really mean Opus 4.5 or later

Comment by lambda 1 day ago

Right. Local models haven't quite hit that level yet. The biggest open models, which you need tens of thousands of dollars of hardware to run at reasonable speed, have pretty much hit that level of capability, but most models you can reasonably run at home aren't quite there yet. But given the gap, if local models keep improving, you'd expect to maybe see that level by this November.

Comment by zozbot234 1 day ago

My understanding is that we could in fact run the largest models on "reasonable" home hardware by focusing on throughput rather than raw speed and having them do unattended inference in large batches. The big proprietary suppliers have no interest in this because their own incentive is to fill all the physical space available with top-performing hardware and doing huge amounts of inference as quickly as possible. A home user with limited hardware investment has very different constraints.

Comment by rvnx 1 day ago

To me totally yes, even further, if they keep their existing route, over time people will stop using Anthropic.

More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).

In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?

Just use Gemma/Gemini/Siri or whatever.

Pornography and uncensored models is also pushing toward local models.

It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).

The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.

For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.

It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).

Comment by spullara 1 day ago

This is the only setup that I think is reasonable to use locally right now. I had an agent set it up for me from this guys recipe:

https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...

One thing I did change was the context length to 256k rather than 64k.

Comment by nicman23 1 day ago

about the edit tool it is almost always trailing white spaces. if you give it a skill with a sed 's/( )*$//g' or something like that it speeds up things

Comment by awllau 1 day ago

Based on your explanation, it doesn't sound feasible for me, a complete non-engineer, to switch to fully offline? I do a lot of back and forth discussion with LLMs as someone who reads and writes 0 code.

Comment by Greenpants 1 day ago

I'm afraid I'd have to agree. That is, unless you have 512GB+ RAM sitting on a shelf and run the much larger SOTA-comparable local models.

Comment by motbus3 1 day ago

Try deepseek V4 flash

Comment by agnelnieves 8 hours ago

there goes the rest of my night

Comment by nyxtom 1 day ago

Have you found that being much more spec driven helps guide it better?

Comment by timmit 1 day ago

I got a 48GB Ram MacBook, somehow I cannot even run a 20b model, I was suprised that you get 35b model locally.

Comment by klardotsh 1 day ago

4-5 bit quants would probably fit pretty well on your rig. Check HuggingFace for Qwen3.6-35B-A3B-MTP-GGUF [1]. They've also got a cool UI thing these days to help indicate which quants of a model will run on your hardware.

Full octane isn't gonna fit on much of anything south of a 128GB machine once adding KV cache.

[1]: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Comment by GardenLetter27 1 day ago

Could the harness not check for a failed tool call and pass it to a small model for correction without clogging up the main context?

Comment by lambda 1 day ago

The thing is, to do a proper fix it would really need all of the context (maybe the tool call that failed was for an edit to a file that was last touched way at the beginning of the context), so you'd need to either keep that smaller model running doing prompt processing all the time, or have a very long wait while it does prompt processing on your whole session.

And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.

Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context

Comment by everforward 1 day ago

An illustrative example I've seen a lot is creating Jira tickets in projects with custom fields marked as mandatory. It tries to create the ticket without the field and the tool call fails. The LLM needs access to the full context so that it can generate text to put in the "Why couldn't this meeting be an email?" field.

Comment by Greenpants 1 day ago

I'm actually quite sure that directly retrying the tool call would often fix the edit-call already. But these models have been trained to "think" for a while for any problem solving, so they'll presume the problem of the edit is more fundamental and spend unnecessary tokens filling up the context.

I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.

Comment by underdeserver 22 hours ago

Nit - it is not completely free.

You are paying for the extra power draw.

Comment by amelius 1 day ago

Sounds super cool, don't get me wrong, but I suppose for most people the bar is higher than HTML/CSS.

Comment by nozzlegear 1 day ago

I use local LLMs on my Mac Studio to write and pass unit test suites in F#, among other boring project chores I don't want to do myself.

Comment by q3k 1 day ago

I love to warm up a whole rack of servers just so that some shitass buggy TUI can generate a line of bash that comments out my test runner.

We truly live in the dumbest timeline.

Comment by rjblackman 1 day ago

it might be worth trying oh-my-pi in your case as it claims to improve the edit calls by using a unique patching format.

Comment by yieldcrv 1 day ago

> It gets into loops quite often

matches my experience and a deal breaker

also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.

200k context windows and above for me now

I saw a paper last night that should help this a lot though

Comment by Greenpants 1 day ago

I get that it's a deal breaker to some; it definitely requires patience.

In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."

Comment by kennywinker 1 day ago

Qwen3.6-35b handles 256k context fine if you’ve got room for it. I’m running it with 128k context with just 16gb vram.

Comment by p0w3n3d 1 day ago

which coding agent are you using?

Comment by krainboltgreene 1 day ago

> is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture

I don't want to be rude, but your linkedin has a sumtotal (generous) of like 8 months of programming as a profession (job title is AI Engineer). The rest is at best programming adjacent. How would you know what either of these situations are really like?

Comment by SoftTalker 1 day ago

I haven't logged in to LinkedIn or looked at it since a former employer demanded that everyone create a profile. So mine is now about 20 years out of date.

Comment by krainboltgreene 1 day ago

His is very up to date. Not everyone is you.

Comment by animanoir 1 day ago

[dead]

Comment by nobody_r_knows 1 day ago

[dead]

Comment by horsawlarway 1 day ago

For personal use, yes.

I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.

I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.

To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.

For my personal needs, free beats $100/m.

I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).

Some example projects

- Replacement launcher for android tvs (with usage monitoring and tracking for kids)

- Custom admin portals for my k8s cluster services

- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)

- Grocery list management and meal planning (mostly via openclaw)

- some custom workflows for 3d asset generation in comfyui.

---

Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.

Comment by rootlocus 1 day ago

2x RTX3090 are around $4400. Without any electricity costs or other parts, that's 3.6 years of $100/m claude.

Comment by overgard 1 day ago

Assuming the $100/m claude subscription is still around in three years.

Comment by booi 1 day ago

we will be lucky if it's still around in 3 months..

Comment by reddalo 1 day ago

[dead]

Comment by oofbey 1 day ago

I think there’s a reasonable argument that a burst bubble will cause prices to drop. Prices are very high because they’re trying to justify these trillion dollar valuations on IP alone. If that fantasy goes away then prices will fall down to just silicon and electricity, which looks more like Chinese model prices. Hard to say how it will play out but the direction isn’t obvious to me.

Comment by horsawlarway 1 day ago

Yes, today is not a great time to purchase hardware.

When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.

My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.

---

I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.

There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.

You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.

If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.

You'll spend less on power too.

My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.

Comment by tracker1 1 day ago

If you're willing to go the AMD route, the AMD Radeon Pro R9700 definitely looks interesting for the price compared to NVidia.

Comment by felooboolooomba 1 day ago

Can we also run LLMs on Radeon?

Comment by lloyd-christmas 1 day ago

I run qwen 27B:Q4 @ 130k context at 50 t/s on a single R9700, and have a 7900XT that runs mellum 12B:Q8 as its subagent. R9700s do really well at low wattage and underclocking as well. It's designed to run at 300W, mine is throttled at 210W, and only had an 8% slowdown. If I had somewhere else to put my desktop in my house, I'd bump it up to 240W and there would be zero perf degradation.

Comment by freetonik 1 day ago

That's also years of top tier PC gaming, if you're into that.

Comment by augusto-moura 1 day ago

2x RTX3090 is extremely overkill for gaming, you can run any released game on earth on ultra for much less

Comment by drnick1 1 day ago

1x RTX3090 is absolutely not overkill for gaming however. Nowadays it's barely enough to get 60FPS in 4K in some recently released games. But the shocking part is that my 3090 is still probably worth as much as when I bought it about 4 years ago.

Comment by arcanemachiner 1 day ago

It's probably worth more now.

Comment by overgard 1 day ago

Having a second card doesn't really work well for gaming.

Comment by davkan 1 day ago

There is currently no gpu in production that can max out the largest and fastest displays in graphically demanding games. We have monitors that are the equivalent of two 4k monitors side by side and run at 240hz. I have a 5080 and have to turn down settings to get 60fps in cyberpunk.

Comment by justaj 23 hours ago

What if you do integer downscaling to 1080p on those 4k displays?

Comment by yurishimo 20 hours ago

That's how most raytracing is done these days anyway. The game is rendered at a much lower resolution, the raytracing math is applied, and then it is upsampled to the target resolution.

If you set the target resolution to 1080p, not much changes in the render pipeline except the that final upscaling step. To get better quality, the lower resolution is bumped up so there is more data to work with for the upsampling, but the scaling performance can be very hit or miss depending on the game as the engine itself often can play a huge role in rendering performance.

As far as rendering the 1080p image at 4k, yea it works fine, but there will always be little artefacts that remain for those looking for them. 1440p seems to be the sweet spot for gamers today, but 4k is really nice for when you're not gaming as most online video is now made for dual use on televisions.

Comment by lowbloodsugar 1 day ago

I can’t run 4k HDR cyberpunk 2077 at 240hz with path tracing. I’m managing ~120fps. I’ve got a Blackwell 6000. I didn’t buy it for games, but there are still games and setups where the GPU is the bottleneck. I don’t even have an 8k TV.

Comment by googletron 1 day ago

what?

Comment by kakacik 1 day ago

AFAIK nvidia cards dont work in tandem (aka sli in the past) very well these days. So that aint true.

Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.

Comment by himata4113 1 day ago

You can have the 2nd card as an offload for upscaling, frame generation and whatnot.

Comment by irishcoffee 1 day ago

When I'm not running models I use the 2nd one in a pass-thru configuration to a windows vm for various things, usually gaming.

Comment by matheusmoreira 1 day ago

Those GPUs can also play video games or mine cryptocurrency. They can also be sold later.

We should own things, not rent them. We should all do what we can to keep the fabled 2030 agenda at bay.

Comment by driverdan 1 day ago

If you pay $2200 for a 3090 you're a sucker. They're not worth anything close to that.

Comment by jmuguy 1 day ago

Or a really excellent experience playing Satisfactory with the settings cranked up, which is priceless.

Comment by tripleee 1 day ago

Christ GPU prices have gotten crazy

How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM

Comment by overgard 1 day ago

In my personal experience, I wouldn't bother with 16GB cards for coding -- the useful models are _slightly_ too large to work at any reasonable speed

Comment by toyg 1 day ago

That's not my experience, and the trajectory is good anyway - what doesn't work perfectly today will be just fine in a few months.

In a quickly moving field, it's amazing how much money one can save by overcoming FOMO and not living on the bleeding edge. It's like waiting for Steam sales, the games will be just as good.

Comment by overgard 16 hours ago

Curious what model you're using that works well on a 16GB card? I very much want to use my 5080 for inference, but everything I've tried so far has either just not been good enough or painfully slow.

Comment by toyg 15 hours ago

Qwen 3.5-9b-Q4_K_M.

I have a 5080 too! For me, the key has been dropping Ollama for Llama.cpp, which is not particularly scary to configure anymore and just skyrocketed performance. I download the models with LM Studio, then run them with llama.cpp.

Comment by lambda 1 day ago

That should do pretty well. Memory bandwidth is the biggest bottleneck for token generation, at 644 GB/s you should be able to do pretty well on a 9070, while prompt proessing is more compute bound and Nvidia tends to have the edge there.

16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.

Comment by tracker1 1 day ago

You can get an R9700 with 32gb vram for ~$1200-1400 depending on where you live, which is probably a better option for AI use than 2x 9070(xt)

Comment by lambda 1 day ago

Yeah, definitely.

Comment by picofarad 20 hours ago

My 3090 was $700 shipped. Shop around. I'd expect to pay "thousands" for the nodded 3090 with 48+ gigs of vram.

Comment by throwawayffffas 19 hours ago

I cheaped out and got 2 7900XT I get about 80 tps on qwen3.6 35b a3b. The cost when I got them before the memory crunch was $1400ish. On retrospect I should have forked over a couple hundred more and gotten the 7900XTX for the extra VRAM.

Comment by nyrikki 1 day ago

You can get 60tps with three 1080tis and the sparse model, and I bet two 16gb 5060tis would do the same for ~1200. One 3090 is enough for a useful system, even on an old am4 host.

Comment by jononor 15 hours ago

Dual 5060ti 16gb does over 100 tok/s on 35B A3B. Even with PCIE Gen 4 x4, which quite a lot of motherboards can do. Though Gen 4 x8 or Gen 5 x4 is slightly faster. Misc working notes on this hardware combo here, https://github.com/jonnor/embeddedml/tree/master/handson/mic...

Comment by fluoridation 1 day ago

Look in the used market, not new. There must some that can be had for much, much less than that.

Comment by flowerthoughts 1 day ago

In 3.6 years, chances are they are still worth $3k. Unless some new chip fab pops up that can spam the chip market. Even if the AI bubble bursts, I doubt we'll see high-RAM GPUs sell off.

Comment by sieabahlpark 1 day ago

[dead]

Comment by kpw94 1 day ago

> gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models

Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !

- https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")

- https://blog.google/innovation-and-ai/technology/developers-...

Comment by me_bx 1 day ago

TIL:

> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model

Comment by SubiculumCode 1 day ago

How is the the QAT models at coding? I looked for opinions since the release and haven't found much.

Comment by twothreeone 1 day ago

> unsloth/Qwen3.6-35B-A3B-MTP-GGUF

I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.

The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.

It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.

Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.

Comment by horsawlarway 1 day ago

I don't generally switch to implementing myself on the model, although there are definitely times where I stop it and correct it mid-task.

It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.

I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).

I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".

I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.

I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.

Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.

Comment by unethical_ban 1 day ago

I'm so out of the loop on this stuff, it's the first time in my IT career I feel really behind on things.

I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.

I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?

Comment by cruffle_duffle 8 hours ago

100-200 like scripts are tiny especially for something easy to scope like a vendor api. Give opus a much, much larger challenge and see what you get back. You really don’t need to see the code much at all anymore except for some steering now and then.

Comment by NamlchakKhandro 23 hours ago

aider sucks tbh... you should invest time in learning how to customise pi. every other harness is crap and hype.

Comment by twothreeone 15 hours ago

Thanks! yeah, maybe I'll try to look at that next..

Comment by Applejinx 21 hours ago

See, that makes it sound better in my world: I'm doing all 'manually implement' all the time, and have no interest in becoming the ten millionth manager to hit the software dev landscape. It boggles my mind that people think this is a win.

I regularly use a pocket calculator: either a physical one, or Apple's Calculator app if I want more decimal places. 'Too stupid' isn't a thing for me if it can cough up some math that would be inconvenient for me to work out by hand. tok/s also isn't a concern because I'm not expecting more than I can read. My ideal scenario would be occasional diversions into querying a 'coding calculator' that can give examples along the lines I want, my way.

I'll make a mental note that Qwen shows signs of being the kind of calculator I'd use for a specific task. Context switching isn't a burden if you're not looking to switch over to vibe/manage and stay there.

Comment by twothreeone 15 hours ago

The things is: it does _feel_ like you're moving faster when Claude is in the zone and does what it's supposed to. You're essentially flying a plane on auto pilot, occasionally telling it to slightly adjust course. Only that now you can fly 10 planes in parallel, all to different destinations.

Is it _objectively_ more productive? I doubt there's a clear-cut answer in the long run (my main suspicion is that since you're essentially creating 10x unnecessary complexity, you'll likely never recover from all the cruft and maintenance kills you in the end - maybe people will find solutions for that though).

Comment by gonzalohm 1 day ago

Did you double the tokens per second by adding a second GPU or was the increase significantly less?

Comment by horsawlarway 1 day ago

No real change in inference speed. It basically just allows me to slot in more context or a bigger model.

A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.

Sometimes that matters, a lot of times it doesn't.

On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.

I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).

Comment by mirekrusin 1 day ago

You’re adding extra gpu for more vram, not speed.

Comment by l332mn 1 day ago

Mind sharing your setup? I also have dual 3090s, but getting nowhere close to 300k context limits with 4 bit quantized models at that size (using vllm).

Comment by 19 hours ago

Comment by agup792 1 day ago

That sounds amazing. If I had some GPUs sitting around, I would totally do it. Sounds expensive to do it otherwise though.

Comment by anhtqweb 1 day ago

Grocery list management and meal planning sounds interesting. Would you mind sharing a little bit more on your use case please?

Comment by codinhood 1 day ago

I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.

Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.

Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.

Comment by pyeri 1 day ago

At some point, there will come a saturation point for that "Opportunity cost FOMO train ride", and I think we are already past that point. Mythos class models are a whole different beasts and cutting edge on reasoning but not much use for the problem domains most developers are trying to solve.

The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.

Comment by codinhood 1 day ago

Yeah this is exactly what I'm waiting for.

Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.

Comment by bob1029 1 day ago

I've got a machine in a corner collecting dust that cost me $12k to build 2 years ago. It runs fine but it's wildly impractical to use as a daily driver (loud/hot). I keep it as a reminder to not do this again.

At my current pace it would take me until sometime late 2030 to spend the same amount in gpt5.5 tokens.

Comment by anonzzzies 23 hours ago

You forget that, especially on HN, many people are scaremongering that prices will soon skyrocket. Then it will be another story... I easily run $4k+/mo on my claude sub; if I would have to pay that, I definitely would spend 12k on hardware instead and accept a dumber helper.

Comment by jonfw 21 hours ago

You are not stuck between public API pricing for frontier models via Claude and self hosted.

Remember that there are other LLM providers, open models, and previous gen models, that are way cheaper that frontier Claude and still way better than what can realistically run locally

Comment by mark_l_watson 1 day ago

Sounds like a correct conclusion to me also. I am trying to transition to a layered system: local, then OpenCode with commercial vendor APIs for models like DeepSeek v4 flash, then DeepSeek v4 Pro.

With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.

Comment by sakopov 1 day ago

This seems to be the answer. Building a rig with a decent graphics card will cost $2k+ and will produce sub-par results. Might as well milk the $100/m Claude sub until open-source alternatives reach parity with today's frontier models.

Comment by phyzix5761 1 day ago

The opportunity cost to who? Its getting super expensive for businesses and engineers across the board to pay for frontier models.

Comment by Gigachad 1 day ago

The cost of the hardware to run local models is still massively more expensive than the subscriptions while offering worse models.

Eventually I think it will even out but right now the hosted stuff is very subsidised.

Comment by kristopolous 1 day ago

that's super contextually dependent. I use them just as essentially a decompress of what I already know that I'm doing. I legitimately use 4B models just fine. I've got a large number of tools that make this entirely feasible and a daily driver for me (like https://github.com/day50-dev/llm-manpage-tool) ...

It's not really a bitter lesson here, I can scale those 4B models easier than someone can scale their 1000B models.

Comment by gunapologist99 1 day ago

Rather than Occam, consider Pareto?

If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)

Comment by MadrasThorn 1 day ago

It's great at accelerating hardware innovation however.

Comment by jrm4 1 day ago

But you're pretty much measuring opportunity cost in tokens per second, no?

I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."

I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)

Comment by codinhood 1 day ago

If you’re arguing that model metrics don’t necessarily translate into useful output, I agree. That’s not how I measure the success of a mode and not really the point I'm trying to make. I try to set things up and test it on my actual projects.

What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?

Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.

Comment by jrm4 1 day ago

Having, e.g. seen Microsoft maintain a monopoly for well over a decade, there's nothing in my experience that suggests that "quality always beats hype" is remotely true.

It's entirely possible Claude is just winning the hype game.

Comment by robertlagrant 23 hours ago

Microsoft have not maintained a monopoly on search, mobile, or maps, and they seem to mostly maintain their large market segments based on familiarity, not hype.

Comment by jrm4 18 hours ago

? I was speaking historically, not now

Comment by robertlagrant 17 hours ago

I don't know why you're surprised; you didn't specify you meant historically.

Comment by jrm4 9 hours ago

That was the point of the "e.g."

Comment by Rastonbury 1 day ago

I think they are referring to the opportunity cost of time saved on doing things a local model cannot do or fixing it's mistakes against the cost of a subscription

Comment by NamlchakKhandro 22 hours ago

Thinking Claude is leading edge... I really think you need to re-evaluate what you research you think you're doing.

Claude Code is not Claude Opus/Sonnet/Haiku.

Comment by reassess_blind 20 hours ago

What is leading edge then?

Comment by bluejay2387 1 day ago

About 90% of my coding is on Qwen 3.6 27b and Open Code with some custom skills and Semble. It is NOT as smart as CC or Codex but its enough to get most of my work done. I didn't set out to replace CC and Codex (I have an RTX 6000 so the TPS is faster than I care about, but the RTX 6000 was originally for other work). I only tried this just to see how close you could get to a frontier model for coding as an experiment, but it was good enough that I stuck with it. I still fall back to Codex for really complicated stuff and to polish UI's as that seems to be the weakest element to working in Qwen.This isn't a recommendation because I don't think most people have an RTX 6000 laying around and the cost would be many years of MAX CC or Codex subscriptions, but at least this seems possible. Maybe in a few more years it will even be practical.

Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.

Comment by heipei 1 day ago

Same here, I use Qwen 3.6 27b (Q6 quant) with llama.cpp on an RTX 5090 using the pi agent exclusively now. The fact that it's local means that I never have to think about token pricing, quotas, time of day, or data sensitivity. I have limited the GPU from 600W to 450W which means the system stays whisper quiet during inference.

I have become so "lazy" (in a good way), so far that I've started using the model for lots of daily mundane things on top of just coding:

  * "commit this on a branch, push, create a PR and assign $nickname for review"
  * "Use the Stripe CLI to download all open and overdue invoices and reconcile them with this CSV export from our bank account."
  * "Use these Elasticsearch credentials to summarise what kind of operations are causing load at the moment."
  * "Tell me if our codebase already supports X and where it's  implemented."

Comment by amarshall 1 day ago

What context length and kv cache quant (if any) are you using? And MTP?

Comment by heipei 1 day ago

No KV cache quant, context length 50% of original, MTP absolutely. These are the relevant cmdline attributes. Getting around 100t/s with this setup, even when watt-limited to 450W.

  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0 --metrics --jinja --chat-template-file chat_template.jinja --chat-template-kwargs '{"preserve_thinking": true}' --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-p-min 0.75 -ngl 99 -c 131072 -fa on -np 1 -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_K

Comment by lloyd-christmas 1 day ago

Not the person you asked, but I have a 9700 which has the same VRAM, and running Q6 on it with unquantized kv gives me 50k context. Putting -ctv q8_0 ups that to 70k. I normally run Q4 with unquantized kv @ 130k at 50 t/s (mtp 3), with the disclaimer that I'm running PCIe gen4x8, so I'm slightly slowed. I've found that quantizing k leads to broken json on tool calls, which is fairly unrecoverable, but YMMV.

Comment by bo1024 1 day ago

Qwen3.5-122B is actually Qwen3.5-122B-A10B. The A10B means that this is a "mixture of experts" model where only 10B parameters are activated at a given time. Whereas Qwen3.6-27B is a "dense" model where all 27B parameters are activated all the time. So for many tasks, you'd expect the 27B dense model to be better than the 122B-A10B model.

Comment by user43928 1 day ago

I am forced to use Qwen 3.6 27b at work and found it next to useless. I might as well do all the work manually rather than having it implement another mess or get the debugging entirely wrong.

It feels like anything less than Sonnet is just a waste of time, apart from use as a smarter search function.

It also strikes me as strange that you would mention Codex for UI polish, as it's notoriously bad at UI, and far behind Claude Opus. Altman specifically posted that they are working to improve this for the next model release.

Comment by sejje 1 day ago

It might be good at analysis & review, writing documentation, git commits, etc--even if it's not good at coding.

All the drudgery.

Comment by user43928 1 day ago

Bad AI written documentation and commits are not great, particularly when you work in a team.

I almost find it offensive when colleagues open a MR with an obvious slop description that's frequently inaccurate.

That said, I find AI useful for a lot of drudgery like resolving merge conflicts or splitting changes out into separate MRs.

Particularly with the latter I had issues with small models, they butchered the changes I wanted moved. Not even on the second attempt did GPT 5.4 mini manage to move 10-20 lines to another file without modifying them in the process.

Comment by htrp 1 day ago

why 27b vs 35b? Is MoE that much worse for coding?

Comment by amarshall 1 day ago

Can take the geometric mean of total and active parameters of MoE to get approximate equivalent quality to dense model params. So sqrt(35*10)≈18.7.

The trade-off of MoE is that it is worse but faster for the same total size.

Comment by electronsoup 1 day ago

Yeah MoE is a little worse for the same size, but you can often run bigger MoEs at respectable speeds even on cpu ram offload. The dense models really need to be 100% vram

Comment by pierotofy 1 day ago

Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/

Comment by jacobgold 1 day ago

"Quality is like running edge models from 8-12 months ago."

That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.

Comment by sbrother 1 day ago

I strongly agree on that being the release where these tools got good enough to substantially speed up my professional work. I have to admit I was super skeptical of AI coding until then.

Comment by deaux 1 day ago

Your skepticism led you to underrate the usefulness until then. Those who have been using agentic coding for the last 2 years can tell you Opus 4.6 was not a step change in quality, it was mostly a step change in the Overton Window and narrative.

Comment by dnautics 1 day ago

for me (might be because of the language im using) i had a substantial bump around september and a huge bump around January.

in my stuff now i use an OT library that claude put finishing touches on in September.

Comment by storus 1 day ago

You can already get Opus 4.6 level of performance on subtasks with some local models. So you need to pick a proper code writer, plan writer, code tester etc. model that matches your target expectations and use a coding tool that allows calling different LLMs for different subtasks. For example, people use StepFun 3.x or DeepSeek4-Flash for planning, Qwen3.6-27B for coding.

Comment by kelnos 1 day ago

Not sure what you mean by "primary driver", but I was finding even Sonnet quite useful for coding tasks, even about 12-14 months ago (I was too cheap to pay more than $20/month back then, and Opus hit my limits too quickly).

Certainly I get a ton more value out of Opus today, but I could absolutely see someone deciding to limit themselves to 8-to-12-months-ago Opus performance for privacy (or other) reasons.

Comment by alexandra_au 1 day ago

You have your dates and models wrong, it was Opus 4.5 released in November 2025, that changed everything, Opus 4.6 was released in February 2026.

Comment by jacobgold 1 day ago

You're right. December is when things felt differnt but Opus 4.5 was actually released November 24, 2025.

https://www.anthropic.com/news/claude-opus-4-5

Comment by Projectiboga 1 day ago

So thalen it might be 6-8 months to get to useable on a local open model? Of course state of the art will be a year ahead, a generation at the current pace.

Comment by pierotofy 1 day ago

I use it for work.

Comment by jacobgold 1 day ago

That's cool if you prefer it, but it is hard to imagine it being a strictly rational choice when much better quality is available at a price that is small relative to the cost of an employee. Or is there something specific about your use-case?

Comment by vector_spaces 1 day ago

Not all work requires every facet to be so sharply optimized, and there may be other constraints that are completely invisible to you. Some that were easy for me to imagine: the parent works in a heavily regulated industry, their IT team is slow-moving and paranoid and this is a safe, under-the-radar workaround, the output is "good enough" for their purposes and they find tinkering with it to be fun.

Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say

Comment by pierotofy 1 day ago

To me, what's not rational is believing you must rent the tools of your trade while exposing all of your employer's intellectual property to a third party. Difference of opinion.

Comment by jacobgold 1 day ago

It's not my opinion that you "must" rent tools but it certainly is the pragmatic choice in 2026. I would be as happy as anyone for this situation to change and I expect it to at some point.

Comment by lokar 1 day ago

Won’t it depend on what you use it for? A less capable system might be fine for boilerplate, moderate re-factoring, etc. Not everyone is building whole features in one go.

Comment by epolanski 1 day ago

Why don't you people bother to try instead of chasing the latest shiny thing?

You must be the type of crowd that writes websites with React and Tailwind and pretend to be engineers and have an opinion on everything.

Comment by trueno 1 day ago

i have a 128gb m4 max macbook pro i've been wanting to tinker with this stuff but genuinely never find the time. any mac users in here running similar to the above that can share their experience?

i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.

Comment by brycesub 1 day ago

If you have a 128GB Mac you really ought to try out: https://github.com/antirez/ds4 by the creator of redis. This is probably as close to it gets to state-of-the-art local LLM + agentic coding.

Comment by __mharrison__ 1 day ago

Using this just this morning on my DGX Spark. A little slower than frontier models but my $200/mo weekly usage exhausted with 3 days left on the week...

(Shouldn't have done that refactoring job in high mode)

Comment by trueno 1 day ago

well this is supremely interesting thanks for putting it on my radar

Comment by lostlogin 1 day ago

Thank you.

Comment by htrp 1 day ago

Use your ClaudeCode sub and tell it to set it up for you

Comment by dirkolbrich 1 day ago

I have the same machine. You might look into https://omlx.ai/ a „macOS-native MLX server“. pi.dev for the agent with MCP, web-search and sub-agents extension.

Comment by atomicnumber3 1 day ago

Same. I have no desire to use Claude at all anymore.

Comment by pierotofy 1 day ago

Yep. Screw Anthropic, CloseAI and all other rent seekers in this space.

Comment by akulbe 1 day ago

I have an M2 Max MBP with 96GB of RAM. What models and setup would you use for this kind of configuration?

Comment by monirmamoun 1 day ago

download LM Studio to play with, and it will let you search for models... try Qwen3.6-35B-A3B at 4,5 or 6 bits (6 bit XL is near perfect) and use pi coder or another harness to access it... you can also try Unsloth studio and try same model to start. LM Studio slighter easier to use, Unsloth probably better quality. Neither one is super great quality by the way (meaning: they crash or act weirdly too often to be full production solutions, but can work for local coding). ONCE YOU DOWNLOAD EITHER APP... it will let you search huggingface for the models. Just type qwen to start looking and ... start messing around. And you connect the pi coder harness using the http interface that LM Studio and Unsloth offer to the engine API, so make sure you figure out that url and turn it on... something like 127.0.0.1:1234/api would be a typical IP (localhost) and port (1234 is used by LM Studio)

Comment by daveidol 1 day ago

Do you do your dev work on the windows machine (referenced in the docs), or do you remotely access it from a separate machine? I ask because I have a RTX 3090 kicking around in a gaming desktop, but I don't use it for any dev work (I use a Macbook Pro).

Comment by snake_n_my_boot 1 day ago

I have a similar set up and have been using it to learn and tinker with open models. I run Ollama on the gaming desktop and point OpenCode to it from my MacBook. Works nicely for me so far.

Comment by NamlchakKhandro 22 hours ago

don't waste your time with windows.

Comment by lelandbatey 1 day ago

I use it, it's good, I get work done, but know that they really mean it when they say

> "Quality is like running edge models from 8-12 months ago"

Don't expect Opus, expect more like Haiku. If you micromanage it, you'll get great results. If you want it to be a human in a box, it'll flounder.

Comment by dheera 1 day ago

Am I doing something wrong or has ollama become shittified?

I'm looking at https://ollama.com/search and the top few models like kimi-k2.7-code say "cloud" and I can't seem to ollama pull them.

I thought the whole POINT of ollama was not-cloud?

Comment by hoherd 1 day ago

I experienced the same situation a month or two ago. One of my friends sent me this article that was illuminating. https://sleepingrobots.com/dreams/stop-using-ollama/

Comment by satvikpendem 1 day ago

Ollama is not recommended to be used. Use llama.cpp.

Comment by jmorgan 1 day ago

The larger models are available on Ollama's cloud as most folks don't have the hardware to run 500B-1T parameter models.

Comment by jubilanti 1 day ago

> I thought the whole POINT of ollama was not-cloud?

It was at first, then the developers realized they had a massive userbase they could monetize. A tale as old as open source...

Comment by toyg 1 day ago

Yes, you've nailed it. Ollama are desperately trying to pull a Cursor - like 3791 other projects in this space.

Comment by dominotw 1 day ago

how much does the setup cost if i want to buy all the hardware now and increased power costs?

Comment by sosodev 1 day ago

The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.

If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.

If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.

The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.

Comment by argee 1 day ago

I use Gemma 4 26B A4B on my Macbook (M4 Pro, 48 GB RAM) to study Rust (and ask other myriad questions). I don't trust it to do a good job in an IDE/harness to one-shot anything but the most trivial of changes. Still, it's fast and good enough that it could handle being a "co-pilot" on small to medium context tasks where you've got your hands on the wheel and your eyes on the road — and are driving under the speed limit. That's remarkable given where we were a couple of years ago.

I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)

Comment by user43928 1 day ago

My experience with smaller models, in this case specifically GPT 5.4 Mini, is that they cannot two-shot moving a 10-20 line code change to another file without modifying it and introducing bugs.

I did not expect perfect reliability, but I thought they could at least get it right on the second attempt once you point out the difference. No such luck, it confidently tells you that now the code is the same, with yet another subtle bug added in the difference.

I don't know what work one would need to do where these garbage-class models would be adequate. Maybe they can masquerade as competent for a few minutes, but in the end the results simply are not right. At best they are suitable for a smarter search or autocomplete, in my opinion.

Comment by Applejinx 20 hours ago

Rather than 'smarter search or autocomplete', maybe the better analogy is 'flexible information lookup about coding that's more responsive to search terms'? I see it not as an intelligence but as a wildly, spectacularly compressed knowledge base. You're trying to get search results that encompass almost any possible thing you could ask for, but rather than drawing from some textbook you're drawing from a distilled combination of ALL textbooks and everything web-scrapable since before the dotcom days.

Of course this doesn't produce a useful person who always makes right choices, but isn't it interesting that you can compress that heavily and draw results out in such a casual way? Seems this remains relevant.

Comment by what 1 day ago

Is it not faster to just do that move yourself instead of asking the clanker to do it?

Comment by user43928 1 day ago

Typing "Create a branch for X and open a separate MR" is faster than me manually creating a branch, selecting and copying changes there, and then opening a MR.

Comment by Kostic 1 day ago

For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.

Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.

Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.

Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.

EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.

Comment by fitzn 1 day ago

What extension do you use in vscode to connect it to local llama.cpp? Or do you auth with github copilot and then point to localhost? Or something else?

Comment by Kostic 1 day ago

Auth with Github Copilot and then point it to localhost[0]. Hopefully the auth to Copilot requirement will be dropped for local models at some point. Would love to use a fully open stack (VSCodium and everything) in the future. My config:

``` [ { "name": "http://127.0.0.1:8888/v1", "vendor": "customendpoint", "apiKey": "llama.cpp", "models": [ { "id": "gemma4-31b", "name": "Gemma 4 31B", "url": "http://127.0.0.1:8888/v1/chat/completions", "toolCalling": true, "vision": true, "maxInputTokens": 65536, "maxOutputTokens": 8192 }, { "id": "qwen3.6-27b", "name": "Qwen 3.6 27B", "url": "http://127.0.0.1:8888/v1/chat/completions", "toolCalling": true, "vision": true, "maxInputTokens": 180224, "maxOutputTokens": 8192 } ] } ] ```

[0] https://code.visualstudio.com/blogs/2025/10/22/bring-your-ow...

Comment by khimaros 1 day ago

i made this specifically for use with vscode/llama.cpp: https://github.com/khimaros/mortar

Comment by stymaar 1 day ago

Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).

I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.

And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.

Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).

Comment by manmal 1 day ago

Have you tried the 27B dense version? It’s way better for coding.

Comment by anana_ 1 day ago

Unfortunately on Strix Halo or any similar unified memory set up, dense models are gonna be dirt slow due to the tiny memory bandwidth... But I agree, 27B is superior.

Comment by stymaar 1 day ago

Exactly. That's why I'm disappointed there wasn't a 122B version, it's 27B but for Strix Halo users.

Comment by arjie 1 day ago

Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.

I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.

Comment by akersten 1 day ago

> I have 2x RTX Pro 6000 Blackwell

Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...

Comment by arjie 1 day ago

I run a small business (https://technologybrother.com) that runs a few small SaaS so I ordered the GPUs through corporate sales. If the barrier is getting an LLC, those are relatively cheap. The nice thing is that if you've got a legitimate business with use for GPUs you can get into the Nvidia Inception Program which has a pretty solid discount.

Comment by zackify 1 day ago

Microcenter is the easiest place but almost any vendor will sell to you after you email them and if you have an LLC

Comment by CamperBob2 1 day ago

Central Computer is a good source in my experience: https://www.centralcomputer.com/all-products/ai-components/a...

No affiliation, I've just ordered from them a few times.

Comment by alexellisuk 18 hours ago

What quant?

Comment by arjie 15 hours ago

fp8

Comment by leptons 1 day ago

Have you measured your electricity consumption for this rig? I have to wonder how much it would cost you per month.

Comment by ux266478 1 day ago

Not nearly as much as you might think. 1.2kw where I live translates to about $0.12/hr, and that's when running full clip. If you have a decent solar hookup, it's small fraction on a sunny day.

The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.

Comment by leptons 1 day ago

I'm paying about $0.19/hr and using half that power just for a large spinning RAID, running some VMs and security cameras. And I'm reconsidering my digital extravagance because of the electric bill. You probably make way more money than I do.

Comment by mtone 1 day ago

Here's a DeepSeek-V4-Flash benchmark on 2X RTX Pro 6000:

  - Prefill: ~10K tok/s
  - Decode: 190 | 375 | 980 tok/s (for 1 | 4 | 16 concurrent requests)
  - GPU power draw during benchmark: Average: 585W | Max: 849W | Limit: 1200W with undervolt. Idle PC is 125W.

I've asked it to calculate the following considering a realistic blend of cached prompts and decode for agentic dev scenario.

Electricity-only (@ USD $0.08/kWh)

  Usage          | IN price  | OUT price | Monthly cost
  Concurrency=1  | $0.040/M  | $0.080/M  | $8.65 to $38.88 (5% to 100% active)
  Concurrency=4  | $0.024/M  | $0.044/M  | up to $48.67 (cheaper per token but higher power draw)

Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?

A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):

                        IN price    OUT price
  Self-hosted           $0.121/M    $0.363/M
  OpenRouter (budget)   $0.098/M    $0.196/M
  OpenRouter (DeepSeek) $0.140/M    $0.280/M

B) Breakeven subscription (users active ~1.5h/day):

    1 user: $563/mo (oh, hai)
    25 users: $23/mo
    100 users: $6/mo

Comment by antirez 20 hours ago

Interestingly if we assume 16 concurrent users, prefill drops to 600 t/s and generation to 61 t/s, and this starts to be dangerously near to M5 Max 35 t/s generation and 400 t/s prefill you get with DwarfStar in your own laptop (that you use for many other things) that costs ~6500 usd/eur.

Comment by zozbot234 19 hours ago

DwarfStar and other end-user inference engines should also support batched/concurrent inference IMHO. Not so much for the overly naïve "serving multiple users" case (the local hardware cannot really compete with ordinary datacenter gear, much less with the big proprietary suppliers; the compute headroom is too small to begin with once the model is in RAM) but rather to improve SSD streamed decode in the unattended inference scenario, where the goal is to meaningfully raise aggregate tok/s whilst facing an overly tight constraint on disk bandwidth, and CPU/GPU compute have a lot of slack.

Of course this requires wide enough batches to have at least some reuse of fetched experts across a batch, but that seems feasible in the "unattended" case, where firing off multiple inferences to be processed together seems quite natural. (We may also have some benefit from better use of the resident experts cache and/or of SSD transfer bandwidth.)

https://github.com/antirez/ds4/issues/275 seems to provide intriguing rough results while https://github.com/antirez/ds4/issues/314 is a valuable contrast where one commonly suggested solution ("just run multiple instances of the engine in parallel") ran into real issues. Neither of these discuss the combined use of batching and SSD streaming yet, so there's room for experimentation.

Comment by arjie 1 day ago

Vouched your comment. Very cool. What are you running on to get 190 tok/s? I get 400 tok/s at c=4 but c=1 is slower than you.

Comment by mtone 1 day ago

I am using the `voipmonitor/vllm:lucifer` docker from the RTX6K discord community discussed at the same link the other commenter posted. It is based around this PR https://github.com/vllm-project/vllm/pull/43477

Comment by arjie 23 hours ago

Ah I’m on the same PR just behind. Thank you.

Comment by CamperBob2 1 day ago

Not OP, but I am seeing up to 260 tokens/second output at c=1 with the recipe at https://github.com/local-inference-lab/rtx6kpro/blob/master/... using 4x 6k cards. Average is more like 200.

There may be a way to get the 2-bit quantized version running even faster on a pair of them.

Comment by arjie 23 hours ago

Thank you. Useful to know. Clipped on top by reduce, I assume.

Comment by CamperBob2 16 hours ago

I think so. The machine I'm using runs at Gen4 x8, while the cards can take advantage of Gen5 x16.

Comment by garethsprice 1 day ago

Using OpenCode + OhMyOpenCode + Qwen 3.6 35B-A3B Q_4_KM on an Ada 4000 (20GB VRAM) at 55 tok/sec for generation (slower than it sounds as OpenCode has a bunch of context it adds). Meaning to check out pi when I get a minute as I hear that one mentioned a lot lately.

I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.

This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.

It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).

The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.

Comment by Roark66 19 hours ago

No. I've tried all the OS models up to Qwen 480B and Kimi (the biggest models). None come even close to Claude.

I do mostly scripting, devops, data processing and systems stuff (ansible playbooks, managing network devices, deploying new software for various things that involves reading docs, writing helm charts, modifying existing ones etc).

All other models Gemini, Chatgpt, grok and all OS models don't come even close. I'd rather use Sonet than Qwen.

It's a sad reality. I was thinking about implementing maybe some sort of "sanity checking" by running every prompt twice on two different models doing sanity checking of the first on the second.

Elaborate knowledge systems help a little, but personally I think Anthropic must be doing something "clever" with its models (processing via multiple models etc). Nothing else in my mind explains the discrepancy.

Comment by ac29 18 hours ago

> I'd rather use Sonnet than Qwen

I get this, though the pace of Chinese releases is relentless. Qwen3.7 Plus/Max (closed variants) feel notably better than Qwen3.6, and Minimax M3 is a big jump from 2.7 in capability as well. Both of these families had their previous major release less than 90 days ago.

Anthropic must have Sonnet 5 either waiting or cooking though, they said smaller and larger models than Opus were coming and we already briefly had the larger model.

Comment by hamsterhooey91 17 hours ago

GPT5.5 with Codex is definitely on-par or better than Opus 4.8, GPT5.4 isn't far behind either (source: our dev team uses opus and gpt interchangeably).

I've also used Composer2.5 on hobby projects and it is definitely on-par with Opus 4.8 (thinking mode: medium), but much faster.

Do you think you're getting better results with Claude because your agent stack (skills, MCPs, etc.) are configured for it and not for the others?

Comment by goranmoomin 1 day ago

I'm not using my models locally, but the majority (80% or more) of my coding agent sessions run on open source models, i.e. DeepSeek v4 Pro and Kimi K2.6 with thinking.

A point that I haven't seen come up a lot, but is very valuable to me is that for open source models, I can select the inference provider myself (even if it's not a local GPU), which means that I can enjoy superb speed (i.e. 300 tok/s) while still spending much less than the big providers.

My experience is that if you were fine with the coding models of yesterday (i.e. Claude Opus from Jan/Feb of 2026), you will be fine with either Kimi K2.6 or DeepSeek v4 Pro. Kimi is a bit more smart but has only 256K context and the performance deteriorates (and sometimes just gets stuck) when it fills up the context window. DeepSeek v4 has a 1M context and performs just as well with much less issues. And they both generate very idiomatic code, gives the same vibe of Opus a few months ago.

Since it's also fast (and does not fixate on trying to fix impossible problems, unlike the recent Opus/GPT 5.5 models), a big benefit is that you still control and steer the coding agent and you won't be losing focus like the major models. They are smart, but they don't fixate as much on trying to do stupid things, and since it's fast, you can just interject. It's a much more pleasant experience than the latest models.

I still use the latest models time to time when I expect the agent to fixate all of the problems and figure out everything themselves, but for me open source models are like 80~90% of all of my sessions.

Comment by jborak 1 day ago

I'm using 4x RTX 5070's and first-gen AMD threadripper (1950X) to run Qwen3.6 27B (MTP) Q6_K with llama.cpp and it works great as a daily driver with Pi. Around 50-60 toks/sec. I also connect a few other applications to it such as OpenWeb UI and recently set up Bifrost, an LLM gateway, to be the primary access point for the models I serve.

I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!

You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.

Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.

I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.

My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.

Comment by zakisaad 1 day ago

This is interesting to me - why'd you go with the 5070 for your 4x build?

At first thought, they are quite skewed toward compute (vs VRAM), which is great for gamers but not so great for running LLMs.

(I run a 5070 in my desktop)

Comment by jborak 17 hours ago

I already had 2x 5070's that I had purchased a year or more ago, so getting an additional two to fill up the PCIe slots on the motherboard seemed reasonable.

I did some math/shopping as well. To get 48GB of VRAM you can get 2x 3090s but that is $3k. A single 5090 is $4k but has 32GB, great for running models like Qwen 27B but maybe nothing else depending on your model settings. Already having 2x 5070, where each card is around $600, it made sense for me to get two more which was $1200 and the memory speeds aligned.

The best value option if you're building from scratch is go with 5060 ti (16GB VRAM). Each of those cards are $570/each on Amazon, cheaper than 4x 5070's. Only downside is memory speed is slightly slower, but you wind up with 64GB of VRAM and you can run big models and small models alongside each other comfortably.

In my setup I ran Qwen3.5 9B for fast inference on simple things and Qwen3.6 27B Q6 for coding work. But I ran into stability issues, so I use llama-swap to dynamically swap models. But with 64GB of VRAM, you wouldn't have that issue. There is overhead to loading LLMs into VRAM that isn't clear, so having extra VRAM is a helpful buffer.

Comment by wsintra2022 1 day ago

Reading through these comments, I can't tell any more whats bots posting on behalf of the AI providers trying to dissuade or whether people just have had negative experiences with local ai models. IMO, Qwen 3.6 27B 8k quants running on a Mac Studio 64g ram, incredible?. No it is not frontier general super shit, its just good. That's it, its good. Its free and private and can take an experienced engineer from being lazy to being really lazy, and that's magic right there. I use llama.cpp and opencode and have great moments of planning some code changes, and letting it run. Walk away. Chill in the hamoc, clean the dishes, have a wank, whatever. Use tmux and ssh in and check in on it. THIS is where the incredible comes in. Anyone telling you otherwise, well check their motives. I have no skin in the game. I just have an easy lazy time.

Comment by epolanski 1 day ago

The software "engineering" field is filled with MIT Leetcode ninjas writing React+Tailwind memory leaking unusable slop, the bar is extremely low.

Comment by cuttysnark 1 day ago

I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.

Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.

In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.

Comment by pianopatrick 1 day ago

I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.

Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."

Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.

Like "The Local AI challenge"

Comment by sowbug 1 day ago

Have you (or anyone else) tried letting agents compete? For example, give the same coding task to two models, or to the same model with a different seed, and have the reviewer choose the better result.

Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.

Comment by grmnygrmny2 1 day ago

Just sharing my $0.02 here - I have ethical objections to using OpenAI or Anthropic products so I was a reluctant adopter of LLMs at all. Local models address most, though not all, my moral objections so I’ve been using them for work and personal projects for about a month.

The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).

It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.

I’m using Pi as my harness.

Comment by mgsram 1 day ago

I have been using local LLMs for about a year and I have settled now on Qwen3.6 27b dense model in GGUF on Mac Studio with 512G of RAM with open code as the harness and llmster(LM Studio). I have also used the Qwen 3.6 35B-A3B but the dense model's accuracy is next level with the tradeoff being tokens/sec. With the Qwen3.6 27b, I usually get anywhere from 25-40 tokens/second. Initially I used them for simple tools but for the past 3-4 months, I have been actually doing production grade coding in C/C++ (Automotive Software stack) and Python (Tools) with Qwen3.6 27b.

The tokens/sec may be less but that kind of helps me in going at the right pace. The workflow I use for green field development / rewrites is to pair with Sonnet for design/architecture, reasoning and a detailed execution plan. I then feed this piece by piece with precise prompting and that does the job. For brown field, it is often a judgement call. There are occasions when I have found Local models to be limited in their reach and I resort to Claude Code

Some of my recent work using Qwen 3.6: 1. Complete rewrite of Power management Service in C using the existing C++ code as reference 2. Tool to parse contents from really complex specifications in Excel format 3. Tool to translate CJK contents to english for feeding into KG

Comment by russelg 1 day ago

Since you have 512GB, might be worth looking into running deepseek4: https://github.com/antirez/ds4

Comment by mgsram 1 day ago

I have tried plenty of other models with full FP32 as wel. However, in terms of balance between accuracy and speed, I found the Qwen 3.6 27B to be the sweet spot.

Comment by jodoherty 1 day ago

I use pi with an RTX Pro 6000 Blackwell to run Gemma 4 31b to do all my agentic coding.

I find it useful.

This side project highlights a similar approach to how I scope and tackle projects at work now:

https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md

https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...

You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.

I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.

Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.

My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.

Comment by yesb 11 hours ago

Tried out the wg-wrap tool, might come in handy from time to time. Neat that it was made with a local model.

Some issues:

1. `wg-wrap healthcheck` was all green even though unprivileged user namespaces was restricted via AppArmor (Debian). That check doesn't seem to work

2. DNS doesn't work (no domains resolve) if the config lists multiple servers e.g. DNS = 1.1.1.1, 1.0.0.1

3. Peer endpoints don't support domain names, only IP addresses

4. Minor: the tool doesn't add an implied /32 cidr prefix for single ip configs (common from some VPN providers).

Comment by jodoherty 10 hours ago

Hey, thanks for the feedback!

If I get some time to circle back, I'll be sure to incorporate these into some new tests and address them.

I want to set up a qemu-system emulator based testing approach so I can incorporate things like AppArmor and SELinux into end to end tests that include different environment configurations.

Part of that will be setting up software defined networking so I can have things like DNS and wireguard VPN servers in a box and then test and evaluate the wg-wrap behavior at the packet level.

Comment by HappySweeney 1 day ago

I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.

Comment by macwhisperer 15 hours ago

I code with like a slew of 20+ custom baked models of all sizes, in various fully custom multi-model harnesses that use different bindings...

the harnesses themselves are just as important as the models...different harnesses give different responses with the same prompt, same model...

if you have the 20/mnth claude sub or codex, you really should be using that to build a good local harness for yourself... claude won't be 20$ forever

build the stack first! when you get that new comp with massive ram, youre already set, just run a larger model!

big cloud models are incredibly good at building and teaching about local ai!

have fun in the rabbit hole!

if you are memory constrained like me, check out my custom models https://huggingface.co/macwhisperer

Comment by henrixd 1 day ago

I have been heavily relying on Qwen3.6-27B-UD-Q4_K_XL.gguf -model and Pi agent (https://pi.dev/) for local tasks and coding. I have used llama-cpp-turboquant fork with some custom cherrypicked MTP patches from another fork.

I'm running this on V100 32GB (~900GB/s memory bandwidth) with 200,000 context window, --spec-type mpt --spec-draft-n-max 3 --spec-draft-n-min 0 --cache-type-k turbo3 --cache-type-v turbo3 to mention most relevant parts.

I usually get somewhere 45-60 t/s. I believe that speed could be improved slightly by switching to ik_llama.cpp fork and Qwen3.6-27B-IQ4_NL.gguf -model but there's no turboquant support and it's with some other tradeoffs too.

Comment by GodelNumbering 1 day ago

As someone that spends all day every day talking to LLMs, I'd say the OSS frontier models + a good harness is already a sufficient combo. For local deployments, we are missing one or two hardware generations (and may not get that soon since hardware companies are heavily favoring datacenter segment) to fully move to a local setup.

Comment by blurbleblurble 1 day ago

My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.

Comment by coder543 1 day ago

I agree completely.

It's also annoying that OpenCode doesn't even try to support local LLMs properly.

Getting OpenCode to work is possible, but extremely manual and clunky to configure. I have written a script to automate converting my llama-server configs into an OpenCode config, and that helps, but it's not ideal.

I have seriously considered writing Yet Another Coding Harness in my free time. I have some ideas for what would make it nice.

Comment by wsintra2022 1 day ago

Not my experience at all. Mac Studio 64g, running Qwen2.7b 8K. Took ten minutes to get up and running, just read some documentation, Unsloth literally walks you through it. For Opencode just edit one file and its good to go. Have not had any issues (besides the occasional LLM related one). Not extremely manual and clunky at all.

Comment by zackify 1 day ago

You have to try pi.dev you can already make it do anything you want. I use opus to customize and tweak parts of it. Its the best harness due to the entire thing being api driven for customization

Comment by horsawlarway 1 day ago

Pi is decent.

I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).

Pi is... just fine.

It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.

If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.

Tack on some of the extensions (ex https://pi.dev/packages/pi-mcp-adapter?name=mcp and https://pi.dev/packages/pi-web-access?name=search) and I get web tooling (ex - perplexity search), access to mcps to do things like drive chrome (https://browsermcp.io/) or firefox (https://github.com/mozilla/firefox-devtools-mcp)

It's fine. Is it as good as a subsidized top tier model? Nope. Is it free and still very capable? Yup.

And personally, I've been having a LOT of fun with the pi sdk (https://pi.dev/docs/latest/sdk)

Which is something that all the other providers charge you api access rates for (ex - thousands a month).

Comment by Insanity 1 day ago

Heard good things about pi.dev but haven’t tried it. It might take care of some of those missing features you mentioned.

Comment by bityard 1 day ago

pi.dev is more like an agent developer kit. It's basically a substrate upon which you spend hours/days/weeks building your own agents or coding framework. It's pretty much the neovim to claude's vscode.

Comment by horsawlarway 1 day ago

I mean - the base experience is just fine, with perfectly reasonable built in tools for file access and editing, plus bash.

But yes - it expands a lot if you're willing to play with it.

I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.

Comment by pianopatrick 1 day ago

I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.

Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."

Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.

Like "The Local AI challenge"

Comment by cheekygeeky 1 day ago

Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...

Comment by _bobm 1 day ago

But, guys, when you say Claude/ GPT models, do you stop to think what are these "models"?

One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.

As a matter of fact, think about these operations, api endpoints, observe their output.

These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.

I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.

The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.

Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.

I am sure that if you change your perspective you will start noticing the scale of the "magic".

Comment by XCSme 1 day ago

> The SOTA models are a deep orchestration of multiple models operating together it isn't a single mode

I don't understand, why does it make you think this is the case?

> how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself

Can you give an example?

Comment by _bobm 1 day ago

> Can you give an example?

Sure, connect opencode to an openai/chatgpt endpoint and use it. You will notice multiple "thinking" parts per "turn".

I put all of these in quotation because... they are part of the orchestration game. For example, it is not known if the thinking parts of a particular turn are chain of thought thinking summaries or just plain response which is masquaraded and thus orchestrated into appearing as thinking.

Further notice the cadence, word choice and sentence formation. Notice sentence construction. Notice "thinking part" construction and sequencing.

There is pretty heavy orchestration.

> I don't understand, why does it make you think this is the case?

Because not all tokens are equal. And if you waste expensive tokens on mundane tasks you will go out of business. This is the reason.

As I said, if you observe the output from these api endpoints you will notice it.

Comment by XCSme 1 day ago

> You will notice multiple "thinking" parts per "turn"

I thought that was the code harness simply minifying the outputs. Many models now no longer return the entire chain-of-thought (to avoid distillation attacks). So yes, we don't get the raw LLM output, but I think it's just the thinking summarized, not a complex orchestration or different models.

I do agree though that now cloud models are kind of a black box, that's not only obfuscated but also changes over time. Companies seem to be changing model capabilities without notifying users, or even hiddenly serving completely different models. This is even worse via OpenRouter, with providers serving open-source models, some of them serve heavily quantized versions or even completely different models.

Comment by _bobm 1 day ago

idk what is "minifying outputs" in the context of what we are talking about. Opencode is opensource, you can find out what it is doing.

Last time I checked, OpenAI even send (in the response) the summary of the thinking part alreafy in markdown, so opencode has to remove the formatting to format it to their liking.

> Many models now no longer return the entire chain-of-thought (to avoid distillation attacks).

This is what they say: to avoid distillation attacks. And to some large extent this is true. I am saying there is a side- effect and this side- effect (depending on how tin-foilly you want to go) may be either a nice thing to have or it may be the "main reason" for all of this.

The side effect is splicing the inference, brokering requests, and what not, which brings huge benefits at scale.

This was my original point: openweights model to a sota model may be apples to oranges. So when will a local model catchup with its single cot run which is not even shaped properly: well never.

It is apples to oranges.

Comment by XCSme 1 day ago

So, are you saying that local models are maybe better than we give them credit? Because with some extra orchestration/processing we could improve the results?

Comment by _bobm 1 day ago

Yes, local models have already all that is needed, they have all the prerequisites.

But what they do not have is the correct shape, the correct approach. This is missing and it shows on multiple scales: it shows in the COT, it shows in the output itself, it shows in the infra to serve the models, it shows in the model orchestration.

This is what anthropic said one year ago:

> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.

Comment by JSR_FDED 1 day ago

This all sounds very mysterious

Comment by _bobm 1 day ago

Yes, but it isn't.

Comment by acc_297 1 day ago

I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies

but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)

I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.

Comment by htrp 1 day ago

Cursor is doing that (i think with Fireworks as their provider)

https://cursor.com/blog/real-time-rl-for-composer

Comment by rolisz 1 day ago

I'm interested in trying something similar. I was thinking to do this for my OpenClaw agent.

About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that

Comment by bravetraveler 1 day ago

I'm largely 'all natural', any of my little LLM usage is local. 128G Strix system, a not-super-dense Qwen or Gemma variant will get 50-80 tok/s output. Not subscribing to Anthropic/OpenAI/etc even in the unlikely event these are the last local models released; simply not needed. Entirely fine without and in-model tool usage covers my currency concerns.

Comment by K0balt 1 day ago

Pretty good results with qwen 3.6 27b dense. I’d say it’s about equal to (Claude) haiku 4.5 maybe sonnet depending on the task.

Comment by kadoban 1 day ago

What tool do you use to drive things for you, out of curiosity?

Comment by K0balt 1 day ago

I use Claude code. You can use it with any model you want to.

Comment by kandros 1 day ago

I’d rather ask my butcher than Haiku for coding tasks

Comment by K0balt 1 day ago

I’d say when qwen works it works like sonnet, when it fails it fails like haiku. So it’s less consistent but works pretty well, I guess? It’s still overall pretty useful for a lot of stuff, and I can run it directly on my MacBook. Once you get an idea of what it can and can’t bite off, it’s pretty easy to break things into chunks it will handle reliably with grace. But I still like to have access to SOTA models for review. Also you can have a SOTA model write a development plan that is basically a bunch of prompts to generate each part, then have the local model follow the plan.

I should mention not to run it at less than q6, I prefer q8.

Comment by papichulo4 1 day ago

Agreed on this. Anthropic has now changed the verbiage on the definitions of the models under `/model` to say that Opus is for everyday usage, and Sonnet is for routine tasks.

There's apparently a reason Sonnet and Haiku have been left in previous version #s.

Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.

Comment by 1 day ago

Comment by nfrankel 1 day ago

I tried. It works in theory: https://blog.frankel.ch/tokensparsamkeit-coding-assistants/#...

Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.

Comment by big-chungus4 1 day ago

I can run Qwen3.6-35B-A3B at 20 TPS on my laptop with RTX 5070 Ti, with partial offloading to RAM. But the most I do is mess with it when I'm bored. I do coding by hand, but I often run autoresearch loops using free models, right now it's MiMo code. Autoresearch often requires my GPU, so it wouldn't be feasible to do when all of my GPU is used up by a local model. For mundane tasks like extracting and formatting specific structured text, I use Gemini in Google search

Comment by neuropacabra 21 hours ago

I went for this one https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-com... and seems very fast, resonable...don't expect 100% replacement, but a lot of things can be done with local LLMs today.

Comment by rsolva 18 hours ago

We have set up two DGX Sparks at work and are self sufficient for our AI needs. It is not SOTA, but it works really well for our needs. No matter what happens around cloud-hosted AI in the future, we will have decent in-house AI without further investments or expenses. We are a company of 24 people.

Comment by vfalbor 21 hours ago

I have tried it and I use it. I think it's going to become the standard way of operating, especially when they start charging us an API fee, which is supposedly the real cost. But of course, with how much they charge for the token and depending on the model, there are so many factors that I think the future is heading towards local models. I believe there are good models out there, and the key is the concept of "pruning," where you select the layers that interest you most and try to reduce the hardware cost of these types of models. The Qwen and Gemma models have been discussed here, but Kimi, which is a fairly powerful model with an efficient pruning system, could be your perfect free co-pilot in terms of coding, and could coexist with the more powerful Opus or Gemini models. The key concept is skills that make this process transparent.

Comment by mitchell_h 1 day ago

Tried. The context windows just weren't big enough.

Comment by coder543 1 day ago

Qwen3.6-27B supports a 1 million token context window.

Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.

Comment by lysace 1 day ago

Got a similar result (my RTX 4070 only has 12 GB). I'm curious about whether 24/32 GB meaningfully improves this enough to make it useful.

Comment by tobyhinloopen 1 day ago

Try it on RAM and CPU.

It’s slower but you can run them.

Comment by lysace 1 day ago

Good idea for evaluating the models, thanks.

Comment by deadbabe 1 day ago

Prompt more directly instead of open ended.

Comment by ljosifov 1 day ago

Not replaced but supplemented. For off-line coding current setup is pi + ds4-server + DeepSeek-V4-Flash REAP25 (on M2 Max 96gb). For simpler programming related (e.g. text2sql) as well as synthetic data generation, current best for me is llama.cpp + Gemma-4-26B-A4B (on gpu 7900xtx 24gb; sometimes nemotron-cascade-2-30b-a3b for 1M context). That and (dabbling now) auto-research uses lots of tokens. Used to get paused running out of token quotas all the time. The 1st local model I found somewhat useful to me was glm-4.7-flash, and it's gotten way better since. Recently between OpenCode Go choice of models at many price points, and DeepSeek-V4 dropping the IQ/$$$ by multiples, have become less reliant on local llms for this auxiliary work. Claude I use but with Zai GLM-5.2 subscription. And maintain GPT subscription for quality models.

Comment by moezd 1 day ago

Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.

Comment by amarshall 1 day ago

Thinking doesn’t change output speed. Anthropic’s models are ~ 40–60 t/s median output speed.

Comment by moezd 13 hours ago

Do you have access to Anthropic model weights to run them locally?

Comment by amarshall 7 hours ago

No, and having that is not required to know output speed nor the effect of thinking, so I don’t see the point in such a superfluous, indirect question.

As for the question you’re likely asking: benchmarks that include speed across many models and providers available at various places e.g. https://artificialanalysis.ai/leaderboards/models

Comment by zftnb666 1 day ago

I replaced Claude with DeepSeek V4 Flash via API. Not local, but 95% the quality at 5% the price. Close enough.

Comment by xmstan 23 hours ago

Yes, we use Qwen 3.6 27B Q6_K. We use it on Radeon R9700 32GB and it delivers 50tps with MTP. We compare it to Sonnet from 4-6 months ago when it comes to output. Totally usable for daily coding.

Comment by 3abiton 1 day ago

I think nearly everyone mentioned Qwen, so my turn I guess. Qwen 3.6 35B Q8 (MTP), on a Strix Halo, with llama.cpp. Around 40-50 t/s. Really great pefromance, I get always suprised by its capability. I used with forge-code directly in zsh. For long context 150k+) it start degrading and forgetting.

Comment by bijowo1676 1 day ago

One of the interesting setups I saw is using expensive frontier models to write and update markdown for your app: specs, product requirements, architecture, etc

but then use cheap/local model to implement the specs.

Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files

but this requires second and third passes, to smooth out the rough edges

has anyone tried that?

Comment by CuriousRose 1 day ago

An equally important issue with local AI use (not coding specific) is ensuring that the harness has fast and up to date data if recency is important in your querires (new package features, docs, etc). Hosted models do web search incredibly well and I think this is a huge part of output quality.

I don't use local hosted models anymore due to hardware contstraints, but I do have some degree of search anonymisation attached to my OpenCode and OpenRouter connected open models.

On my Macbook I run OrbStack that has the following docker containers set to route through a Mullvad based gluetun.

- Firecrawl - fast web scraping

- SearxNG - metasearch

- CloakBrowser - tursile bypassing Playwright alternative

If you wanted to get fancy with the proxy rotation, you could setup numerous instances of Playwright each with their own Mullvad wireguard key in different locations.

Comment by 1 day ago

Comment by heisenbit 1 day ago

I think it is work to set up but I'm also learning a lot setting it up. Mainly using qwen/qwen3.6-35b-a3b mlx with my 48GB M4 MBP which leaves me just enough headroom for docker dev-container and other basics. I use LM Studio to run and am using it via VSCode. A big difference made the system prompt improving the tool integration (I asked GPT for guidance on that). Before that it was not making changes but regenerating code often messing up than helping.

I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.

What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.

Comment by SupLockDef 1 day ago

Local isn't new for me. I am still coding my stuff, but Qwen3-coder:30b on my old rig with a gtx 1070 16gb RAM does wonders for me.

I mostly use it as a google search if I forget a thing, or doing the boilerplates.

I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.

$0.00 / month. That's the budget.

Comment by jboss10 1 day ago

Have you tried qwen3.6 or pi?

Comment by SupLockDef 1 day ago

3.6 is too slow on my old rig for some reasons, so I went back to qwen3-coder.

I did try 3.6 on my main desktop. It was good, but I didn't see much differences than coder, so I am still using my old rig.

Comment by ryandrake 1 day ago

Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?

Comment by riazrizvi 1 day ago

All you get here is some market signal from 1 or 2 posts if you already know how to do it. Most of these responses are garbage.

Comment by porkloin 1 day ago

I have good results with this setup:

Hardware:

- GPU: AMD 7900xtx, 24gb vram

- CPU: AMD 5950x, AM4

- RAM: 64gb DDR4 3600

Software:

- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)

- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units

- Network: tailscale

- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)

- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.

- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.

Models:

- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.

- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?

- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job

Flags (specific for Qwen 27b, since that's primary model):

- `-ngl 99` offload all layers to GPU

- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing

- `-np 1` single slot (no parallel request handling)

- `--no-context-shift` error instead of silently sliding the context window when full

- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)

- `-b 2048` logical batch size (tokens per submission)

- `-ub 1024` physical micro-batch (per GPU pass)

- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling

- `-fa on` flash attention

- `--spec-type draft-mtp` use the model's built-in MTP as the draft model

- `--spec-draft-n-max 3` propose up to 3 draft tokens per step

- `--spec-draft-n-min 0` allow zero drafts if confidence is low

- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path

- `--reasoning-format deepseek` parse <think> blocks in proper format

- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)

- `--jinja` use the GGUF's Jinja chat template

- `--temp 0.6` moderate randomness (Qwen recommended value for coding)

- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)

- `--top-k 20` top-20 candidates (Qwen recommended value for coding)

- `--min-p 0.0 disabled (Qwen recommended value for coding)

Performance (27b, primary model):

- ~65t/s for token generation

- ~600 t/s for prompt processing.

- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.

- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.

I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.

CLI/Harness:

- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)

- Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window

- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.

A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.

This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.

Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(

Comment by ryandrake 1 day ago

Now that's what I'm talking about! Very cool, thank you for the detailed response.

Comment by nake89 1 day ago

I have an RTX 4060 12gb vram. Qwen3.6 35b. I stopped paying for Github Copilot. But I wouldn't say I replaced frontier models with a local one. I still have some dollars in my openrouter when I need to. Also to get interactive agentic coding speeds I need a high tps. So my quant is very small. And I would say a coding harness that is fully extensible is a must to create fully custom workflows tailored for low specs. I use pi (not perfect, still found some hard coded, non-extensible parts)

Comment by anubhav200 1 day ago

Yes, llama.cpp, qwen27b, 35b, claude code. Llama-cpp-manager for managing llama.cpp configs (https://github.com/anubhavgupta/llama-cpp-manager)

Comment by anubhavgupta 1 day ago

Machine: CPU: intel 275hx GPU: Nvidia 5090 Mobile (24GB) RAM: 64GB

Comment by anubhavgupta 1 day ago

One more thing, I also use it along with Whisper-NPU, a speech to text utility that runs on NPU of Intel 275hx and doesn't consumes any GPU resources.

Comment by anubhavgupta 1 day ago

Whisper-NPU (https://github.com/anubhavgupta/whisper-npu)

Comment by milchek 1 day ago

I’ve tried in a 36GB MacBook Pro and haven’t had much success beyond very basic work. Issue for me was the context runs out quick even with smaller models and it’s slower. To get some half decent performance I’d imagine you want 128gb memory and are spending a lot more on hardware. At that point it becomes a question on whether you’d rather have frontier models at a subscription or sink that money into your own equipment. Of course, for those with privacy in mind your only option is forking out the cash for the higher end machines.

Comment by BiraIgnacio 1 day ago

I tried for a bit, with llama.cpp + Qwen + Mac Pro but the results were very poor (both quality and speed).

I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).

Comment by wuschel 1 day ago

I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.

Comment by boringg 1 day ago

Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..

Comment by snoman 1 day ago

If the government is going to gate access to frontier models from here on out, even if new releases are a step function change… which they’re not… then it may be even more comparable to what’s available with a subscription.

Comment by NetOpWibby 1 day ago

I'm looking forward to having Claude Fable at home. THAT is when I'll THINK about replacing Claude (who knows what their next models will be capable of, Fable was damn good for the three days I had it).

Comment by trueno 1 day ago

we keep moving the goalposts on when we're gonna be happy with local. first it was sonnet at home as the good enough, then opus, now it's the mysterious leading model that runs on infrastructure we can't feasibly have at home

Comment by NetOpWibby 17 hours ago

I don't know about "we" but for me, I've never been happy with any of the models to bother with learning how to run one at home. I've got a Turing Pi and a bunch of other gadgets just sitting in a box (well the former I've got running after owning for several years of non-use).

Comment by zaptheimpaler 1 day ago

I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong

Comment by anana_ 1 day ago

Perhaps try a different model? Just from anecdotal experience, I find that the Gemma models smaller than 31B do not tool call as often as they should.

Some of the benchmarks appear to back this up [0]

Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.

[0]: https://artificialanalysis.ai/models/open-source/small?model...

Comment by Rzor 1 day ago

So there was a problem with gemma 4 when it comes to tool calling that Google apparently fixed like 2 or 3 days ago. I remember reading something about this.

Comment by etoxin 1 day ago

I have not. We use openspec with our projects at work. To try and simulate a local rig without spending big cash. I use the hosted models and pay for them with the latest popular local model.

Most small local models don't get tool calling right, however the larger models are now doing this correctly now.

One thing local has not accounted for, is most productive engineers are running multiple cli chats at a time with git worktrees. I normally hover around 3 worktrees + cli-chats.

Comment by dabinat 1 day ago

There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.

Comment by rvnx 1 day ago

I start to believe that adding more and more and more and more and more thinking tokens is the hack that works (this is what gave birth to Fable)

Comment by utopiah 1 day ago

Why would you not think that?

It seems pretty intuitive that pouring more resources into a problem (more GPU, bigger GPUs with more VRAM, bigger datasets, better curated datasets, more efficient ways to train, more efficient way to run inference, etc) then running the result for a longer time, with more layers of verification (running in VMs, model fusion comparing multiple models, having harnesses with testing) will at least lead to marginally better results.

Is it worth it and at what pace will it keep on improving are different questions but I have little doubt that if the industry keep on pouring resources, sure more "works".

Comment by yalogin 19 hours ago

This needs atleast a 30b model or Mr higher and so for most folks it means purchasing a new machine. Given the ram costs this may be become prohibitive and a monthly subscription may feel better roi

Comment by shironnnn_ 1 day ago

I use SpecKit to create a very detailed plan with a high amount of specificity using paid Claude plan.

Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.

The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.

Comment by michaelhoney 1 day ago

don't think she has posted here, but Vicki Boykis blogged about this today:

https://vickiboykis.com/2026/06/15/running-local-models-is-g...

Comment by tumetab1 1 day ago

Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.

Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.

Comment by adam_patarino 20 hours ago

Yes! And we are using it to build Rig AI to make it easier for anyone to do it too!

We are post training qwen 3.6 and combining it with a custom inference engine and harness to get the most out of a smaller model.

Comment by cloudengineer94 22 hours ago

I have tried in both my Mac and my desktop (Rtx 5090) with Gemma 4 and Qwen and so far nothing is quite replacing Claude Code or Kiro for spec driven architecture & development.

I do think we are slowly getting Gemma 4 was a big jump

Comment by ndom91 1 day ago

Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.

I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.

https://github.com/ndom91/llama-dash

Comment by anonymousiam 1 day ago

This was posted shortly after your Ask HN post:

My Homelab AI Dev Platform

https://news.ycombinator.com/item?id=48542433

Comment by jderekw 1 day ago

Running AMD Lemonade as the daily rig, Started with Ollama then over to LMStudio and now standardized on AMD Lemonade which has been helpful to monitor cRAM, CPU, GPU and gRam. The multi-models on Lemonade make it straight forward to run a stack for LLM, Voice to Text, NPU, and Image Generation. Platform also works with Nvidia, Apple, Intel and AMD chip sets.

Comment by trilogic 22 hours ago

https://hugston.com/models/anthropics-fable-qwen36-35biq4-nl

Comment by bArray 1 day ago

I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.

[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct

Comment by patates 1 day ago

I have a mac with loads of ram but I cannot even justify the electricity cost when deepseek is so better than anything I can run locally (including heavy quantizations of deepseek itself) and costs pennies. It's crazy how cheap it is!

Comment by mv4 1 day ago

I've been using MiniMax M2.7 with vllm on my dual Nvidia Spark cluster. Slow (<20 tps) but functional for most of my use cases.

Comment by cmrdporcupine 1 day ago

I was just looking and it should be possible to run this one on 3bit quant on my single Spark? Maybe? Depending on context size? Assuming 3-bit doesn't totally lobotomize it.

Comment by daniban 17 hours ago

I haven't but I'm on the path to attempt this. I want to get a DGX Spark and will be trying Qwen and Kimi.

Comment by xhinker2 1 day ago

Yes, I have. 1. Two RTX 3090s in Linux 22.04 2. Running Qwen3.6-27B Q6_K_XL GGUF 3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine 4. Many times it solve problem that Codex can't solve

https://medium.com/p/f237d575e861

Comment by Departed7405 1 day ago

I tried but OpenCode doesn't have great local model integration. It's just a pain in the ass to set-up.

Plus, you now have zero-data retention models, so the privacy argument has kind of faded.

Comment by sj_tech 1 day ago

I use Qwen 3.6 35B A3B for agentic coding using GitHub Copilot Extension for VSCode. Mac Mini 128GB as the hardware. Seems reasonable for that model size, but I notice looping issue when problem becomes too big to solve. You can use it to do something that you know how to do (saves time).

Comment by 1 day ago

Comment by thesuperbigfrog 1 day ago

Here is a nice setup that works well:

https://discourse.ubuntu.com/t/use-workshop-to-run-opencode-...

Comment by whartung 1 day ago

Will the inevitable M5 releases from Apple change this equation in any meaningful way?

I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.

Comment by fransje26 11 hours ago

> Will the inevitable M5 releases from Apple change this equation in any meaningful way?

No. Apple is also running out of RAM, so you will not have the RAM you need.

Comment by ozten 1 day ago

Yes, for client projects where privacy and security is important, but no enterprise contract:

Open code against Infomaniak hosted OSS models: Qwen3.5-122B-A10B-FP8, Kimi-K2.6.

I use API keys for billing. It performs like Dec 2025 in terms of my productivity back then.

Comment by Lwerewolf 1 day ago

mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.

Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.

Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.

Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.

Comment by derekered 1 day ago

I'm using Qwen 3.6 on my MacBook Pro M5 Pro with 48BG RAM for any work that I am particularly privacy conscious about, like any work with my journaling. It's been working great! I don't have any direct comparisons, but I've been satisfied with the results.

Comment by russelg 1 day ago

I've got the same spec, are you running the 27B or the 35B-A3B? I found the 27B was unusably slow (like 10-15t/s not to mention the prefill times)

Comment by jrflo 1 day ago

I would love to do this if it didn't require such a huge amount of RAM. And the difference in quality is worth it to pay $20-$100/mo if data retention doesn't matter to you.

Comment by cahaya 23 hours ago

Asking for feedback:

Sorry for hijacking the convo, but you (with local models) are my target audience in terms of hardware.

Is anybody willing to test my new app https://document.bot? It is like Cursor IDE but custom harness for knowledge work (PDF's, MS Office files etc).

You can connect your existing offline LLM models through LMStudio, Ollama, or app managed LLM models (Qwen3.5, Gemma 4, etc)

Might have to make a new Ask HN post for this, but again, you are users with good hardware setups.

Comment by 627467 1 day ago

So, everyone has different context, but how free is free running these local models? Like having a power hungry machine always on in the cupboard?

How much does this ware out the hardware?

Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?

Comment by jmichaelson 1 day ago

I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).

I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.

Comment by maelito 22 hours ago

Well not local but using Mistral Vibe CLI for a fixed 17€/m illimited is an incredible value for money.

Comment by drnick1 1 day ago

- What would you say is the best model for coding at the moment that can run on a high end consumer GPU? (Assume an RTX 3090/4090 is available.)

- What "stack" do you recommend? Llama.cpp + OpenCode?

Comment by ecshafer 1 day ago

I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.

Comment by v3ss0n 1 day ago

Yes qwen 3.5 122b+ dgx is working wonders and I ko longer subscribed to any cloud api now. I will post a project which I accomplished in 9 days of long horizons running.

Comment by carlossouza 1 day ago

This should be a recurrent question posted every month

Comment by deepvibrations 22 hours ago

The TLDR is that the best setup is probably Mac Studio (128GB RAM) / MacBook (36GB) with Qwen 3.6 35B (3B active params), or Qwen 3.5 122B model (this one is slow though).

These models are still very capable with good hardware, but they do lack the deep reasoning of major models and require more precise prompting.

So unless you really need the privacy, or have a lot of excess cash, it is not recommended, as considering the price of major models, it's just extremely cost inefficient!

Comment by jmward01 1 day ago

Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.

Comment by abidlabs 1 day ago

Yes! https://huggingface.co/changelog/agent-trace-viewer

Comment by jmward01 1 day ago

Didn't realize they did this. I have avoided pushing data to huggingface. This is all -deeply- private info and I haven't really reviewed their privacy policies and the like. I'll give them a look.

Comment by overgard 1 day ago

I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)

Comment by qu0b 1 day ago

I'm using deepseek V4 on two rtx 6000 pros and its working great. Opus is so slow that I get deepseek to do most of the work and Opus is only used to validate and help plan.

Comment by fortyseven 1 day ago

I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.

Comment by kristianpaul 1 day ago

Qwen3.6 35B on gigabyte aitop (spark clone) but be very specif what you ask and how should be solved

Nemotron super 3 110B works well for 1M context long vibecoding sessions

I also use Pi harness with no extension

Comment by c16 23 hours ago

My experience so far has been Qwen3.5:32b-a3b-coder via Claude Code on a MBP 64gb M4, and a MBP 32gb M5. Just found about qwen3.6 so downloading that currently.

on 64GB M4 I find it's able to do things fairly well. The few times I run out of tokens, I hop over to that and I'm mostly unimpeded. I compare it to the Haiku models, where you have to go in and be surgical about your changes, or like others have said, guide a junior.

on 32GB M5, I find that it works, but around the 30% ctx threshold it slows down quite substantially, so more need to be surgical in your requests. I'll often just have my IDE open and Claude. But maybe I've been too comfortable talking to Sonnet/Opus and so forget I need to be more deliberate in my requests.

My finding here is that the harness is a big part of the problem. CC seems to be very good with Qwen in my experience. Better than OpenCode.

I also run DeepSeek for some other non-structured data tasks and to generate a to-do out of that. That's not coding, so won't go into that, other than to say it's very competent as a small model left to run in the background and automate small parts of my life and process.

tl;dr it's totally doable on a 32gb mbp using ollama, but be precise in your requests and guidance.

Comment by mark_l_watson 1 day ago

I would like to say I run 100% local, but I use Opus + Gemini Pro cumulatively for 3 or 4 hours a week. I also like to use DeepSeek v4 flash with OpenCode for small quick tasks.

I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.

For me, the problem with all local LLM-basic coding agents is slow runtime.

[1] https://leanpub.com/read/local-coding-agents

Comment by _davide_ 1 day ago

i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/

Comment by sosodev 1 day ago

My strix halo board is feeling more useful and less toylike with the recent performance gains combined from MTP, better quantization, and generalized performance improvements across the stack. For example, I can run Unsloth's Gemma4-31B 4-bit QAT model with around 30tg and 200pp. I don't find that to be too slow at all. Particularly because it's nearly full accuracy and good enough for a lot of different stuff I throw at it.

I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.

Comment by _davide_ 1 day ago

you can absolutely use it for some workloads, but as soon as you have some extra complexity for a big repo it'll take forever and the economics are so silly to the point that the electricity bill would be comparable to a subscription. I love having the possibility of running things locally if some random dude decide to pull them plug, and give me solice the fact that i can have 100% private inference, but as the main driver during the day? shoot me

Comment by sosodev 5 hours ago

Meh. My server can run these models for neglible power draw (like ~130W fully maxed out). That's with ~30 tok/s which isn't that bad. I do agree that they're still nowhere near as good as the frontier models though. I do lean on those when I need to get something done with better quality or at a faster speed.

I've also been using Deepseek V4 pro/flash for some work stuff and I do find them to be much closer to frontier capability. I may try running flash at home soon for very patient edits. :)

Comment by hegdeezy 1 day ago

I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!

Comment by redox99 1 day ago

Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5. Not even close. The only open models that are in that neighbor are around 1T params, so forget about running at home.

It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.

There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.

Comment by pbasista 1 day ago

> Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5.

Is that characterization based on some objective facts or benchmarks?

Comment by kube-system 1 day ago

Yes, there aren't any 35B models that are beating frontier models at just about anything generalized

Comment by redox99 1 day ago

Based on private test prompts I've run through OpenRouter.

Comment by xgulfie 1 day ago

I don't need a Ferrari to get to work

Comment by orangeisthe 1 day ago

But you need the best tools to do the job

Comment by cayley_graph 1 day ago

You need tools sufficient to do the job in an economical way, optimizing for both cost and quality. That is what 'best' means. We don't give every engineer all the resources under the sun, only what is appropriate.

I suspect many will realize millions more dollars are being spent than needed to achieve the highest marginal productivity gains, and reallocate accordingly. Who wants more of their money going to developer tooling, rather than bonuses?

Comment by orangeisthe 1 day ago

Of course. I have a $20/mo Codex subscription that has been serving me very well. Occasionally when I run out of quota, I switch to another one of my backup $20/mo subscription.

That's way more economical and produces far better result than any self hosted models today.

Comment by pdyc 1 day ago

yes

harness - pi+custom extension for subagents

model - qwen3.6 35ba3b q4km

hardware - intel arrow lake with 32gb ram

server - llama.cpp vulkan

performance - 15-18t/s generation 50-150t/s pp

planning and task creation is still using claude/gpt but they dont touch the code. All coding is done using this setup.

Example of project made using this setup easyanalytica.com , its of medium size complexity

Comment by anuramat 1 day ago

I wonder what languages people are using; I imagine smaller models would be decent at bash/python but significantly worse at something like rust

Comment by julianlam 1 day ago

Of course.

Qwen 3.6 35B-A3B on a Framework 13 with 32GB of memory.

Running llama.cpp, 15 tokens per second. Outputs code and text faster than I can parse.

Comment by bagol 1 day ago

I wish I could. But, the hardware requirements are just too expensive for me.

Comment by sukuva 1 day ago

[dead]

Comment by AH4oFVbPT4f8 1 day ago

Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.

Comment by xeonax 1 day ago

Whats .NET doing in between?

Comment by AH4oFVbPT4f8 1 day ago

Sorry, I meant to say I was writing .NET C# with the setup

Comment by SkitterKherpi 1 day ago

It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.

Comment by sermakarevich 1 day ago

yes.

- smarter models to create tasks

- local qwen3.6:36B for tasks execution

here is how in details https://news.ycombinator.com/item?id=48520757

Comment by ElenaDaibunny 22 hours ago

we've been building local agents with vision models, works great for gui automation but coding tasks still need cloud models for reliability

Comment by catapart 1 day ago

tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.

Comment by Rzor 1 day ago

RX 9060 XT 16GB here on google/gemma-4-26b-a4b-qat using LM Studio. Context 65k, 23 layers on the GPU, 7 on the CPU, model in memory, mmapped. I'm getting 23-33 tks. Started experimenting 3 days ago (with gemma-4-e4b), don't know what half those settings mean, but 26B, even quantified, feels significantly better at a few small projects I asked it to create ("create a image converter using ffmpeg in bash", "create a canvas animation with real physics, no libraries"[1]).

It's faster than I can read, but it feels slow as hell. I think 40-50 tks is probably much more comfortable and I hope I can reach that when trying this on llamacpp soon enough.

[0] - https://pastes.io/9gaARxE8

[1] - https://jsfiddle.net/pou4nbh9/1/

Model: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gg...

Comment by agentbc9000 1 day ago

Kimi K2.7 is very good - i have been testing it and its very very good, Fable 5 level of goodness.

Comment by bentt 1 day ago

Say more!

Comment by codelion 1 day ago

Using qwen3.6 27b locally with Claude code, it works well for simple coding tasks

Comment by jwr 1 day ago

I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.

Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.

Comment by euroderf 1 day ago

Is anyone managing to do this on a Mac with a measly 8GB ? Asking for a friend.

Comment by alimbada 20 hours ago

You could try running the smaller QAT Gemma 4 models but I doubt they'll be very good for software engineering work.

Comment by euroderf 18 hours ago

Thanks for the reply. What I'm getting from numerous HN discussions is that 8GB is a hopeless case (and the money I saved on RAM should be spent on non-local coding assist).

Comment by SugarReflex 1 day ago

Is anyone using Aider? Is there any decent CLI alternatives to it?

Comment by chungus 1 day ago

Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.

I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.

Power usage is also totally not an issue, AI workload is very different from gaming.

tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.

Comment by system2 1 day ago

Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.

Comment by ColonelPhantom 1 day ago

Which model class requires an 80 GB VRAM GPU? From my perspective, popular models seem to be either in the ~30B range (Qwen3.6, Gemma 4), while the larger models (MiniMax, MiMo, StepFun, Deepseek) are in the multiple hundreds of billions parameters, for which 80 GB is simply too small.

You can just about reach the lower end of the latter category with a 128GB machine like a DGX Spark, Framework Desktop, or M5 Max, though those are usually not super fast. For the former category, you can easily run them fast with something like a 3090 or 5090, hell, probably even a 5060 Ti.

Comment by system2 1 day ago

Video models.

Comment by CamperBob2 1 day ago

This is true. There's not much point in buying only one RTX 6000. You need at least two to run anything interesting that you couldn't run on a 5090. And you can imagine where it goes from there.

Comment by Razengan 1 day ago

Related: Are there any viable distributed AI models?

Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.

Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.

Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.

Comment by joshuamoyers 1 day ago

I think it'd be very hard to achieve viable tokens/s or get arithmetic intensity to be high enough in general, since many cases in existing training and inference are memory bandwidth limited. Definitely seems possible to conceptually have a slow pipeline that is distributed though.

Comment by 1 day ago

Comment by SimianSci 1 day ago

This is unlikely to happen in any meaningful fashion for quite some time.

(TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)

Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.

To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.

Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense. I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.

The dream of distributed AI is a ways off.

Comment by anubhav200 1 day ago

Yes, llama.cpp, qwen 27b and 35b, llama-cpp-manager for managing model configs.(https://github.com/anubhavgupta/llama-cpp-manager)

Comment by nynrathod 23 hours ago

I tried, but honestly, all end with lack of tool or configuration or hardware config. None of them work for me. At end paid apis only providing productivity else free local end with inveting time and less effecient work

Comment by wmedrano 1 day ago

No, but I use GLM5.1 instead of Claude/GPT.

Comment by drnick1 1 day ago

Do you recommend Ollama or bare llama.cpp?

Comment by jboss10 1 day ago

llama.cpp It's faster and more open source. Ollama has some mixed history. I use llama-swap to emulate the Ollama experience.

Comment by shironnnn_ 1 day ago

if on MacOS I recommend llm-mlx which currently renders tokens 10%-15% faster than llama.cpp.

Comment by devin 1 day ago

Anyone here running a tinygrad?

Comment by christkv 1 day ago

Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.

Comment by w10-1 1 day ago

I run many models (but mainly Gemma-4) using oMLX (for caching) on a 32GB M1 max using (gasp) Xcode. For tok/sec response times, I'd say it responds faster than I could read the prompt aloud in many cases (and I'm not constantly polling the Claude status page).

For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").

That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.

One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.

Comment by sometimelurker 1 day ago

yeah I use one one the small MTP qwens and pi

Comment by major505 1 day ago

Yes. I use Owen on my MacBook m1 (16gb) daily, running inside Ollama. Works well. Is not particularly fast, and I need to create a custom imagem that sets the temperature of the model to zero starting, so I don't get over creative with its bullshit, but it works reasonable week.

Comment by Der_Einzige 1 day ago

Secretly the problems many people have with agentic coding are related to poor choice of sampling settings, but the world will wait several more years before this is understood well. top_p and top_k are garbage but they are intentionally kept on purpose because subsequent methods enable coherent high temperature sampling, which is an absolute no go for alignment/safety reasons.

The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.

Comment by devmor 1 day ago

I’d be surprised if this was useful for much. Claude is already almost too slow to do anything serious I’d consider using it for outside of grunt work without parallelizing.

The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.

Comment by hacker_homie 1 day ago

I do qwen3.6 on an amd ai max laptop getting about 6-10tok/s it’s slow enough that I can follow along. It has issues with design and large piles of code. Otherwise it’s a good programming buddy.

Comment by lowbloodsugar 1 day ago

If you want to try it out before dropping $$$ on a GPU, just run something that would fit on your target GPU but online.

Comment by platevoltage 1 day ago

I run very small models locally for code completion and writing boiler plate. I still use Claude in a web browser on occasion since it's free, but the second that goes away, I'll be done with it. They get none of my money.

Comment by epolanski 1 day ago

Not with a local one, but I moved to DeepSeek v4.

Albeit I plan to move to local ones when I will get my hands on a 256+ GB macbook.

Local inference is good enough to help me with my daily job, and doesn't turn me into an assistant to the LLM.

Comment by jay_kyburz 1 day ago

Can anybody let me know how just chatting with Qwen3.6 on a Strix Halo 128GB

If I give it a page of context, can it write a linked list or identify a bad line of CSS?

Is there anywhere online I can chat with a model I could be running at home to see how good it is?

Comment by thrownaway561 1 day ago

I just use DeepSeekV4 Fast... It's cheap as hell. Currently my monthly usage has been

67M Ouput 51M Input

Total $0.83 dollar.

I honestly don't understand why people just don't use DeepSeek.

Comment by ThomasGlanzmann 1 day ago

I do the same. deepseekv4 fast for the 90% of the tasks, if it can't lift it, I use deepseekv4 pro. I use crush as coding agent but removed the blocked commands because I also do a lot of system administration. Love it. I use 8 USD in 7 weeks and use it quiet extensively for all sorts of things, programming, system administration, google search replacement, investments, you name it.

Comment by codemk8 1 day ago

You mean deepseek-v4-flash, right? Same here. I use it for my Hermes agent. It's so cheap that I sometimes feel "guilty". I even put more money than I needed just make sure they do not go out of business.

Comment by ThomasGlanzmann 1 day ago

Yes, I do mean deepseek-v4-flash.

Comment by slvnx 20 hours ago

Do you use deepcode, or which cli and/or coding agent you use it with?

Comment by thrownaway561 12 hours ago

No... I just use the github copilot chat

Comment by jeffrallen 1 day ago

I use Qwen 3.6 on a remote GPU that my work offers. Works fine. Slow and steady, works hard, gets the job done. Probably better at diagnosing than making new code, but whatever.

Comment by gigatexal 1 day ago

I tried to. I just couldn't get over how it made my otherwise whisper quiet M3 Max MacBook Pro 14 for the performance. The sweet spot has been adopting Claude Code to use the Chinese models. Deepseek V4 Pro is very, very good. But I am such a casual local user of AI that my 20/month Claude subscription is enough and I find myself using that more and more.

Comment by dude250711 1 day ago

Yes, running a local model on a natural wetware substrate here.

Recommended setup: plenty of nutrients, some caffeine and a quiet environment.

Performance - not currently measured in tokens: roughly average.

Comment by jasongill 1 day ago

I have been running this stack since well before Claude Code became popular. It works OK but I've found it to be very slow; and despite having a big context window, it seems to lose track of what it's working on and goes down a rabbit hole (or just wastes tokens trying to use the web browser) for hours and is hard to get back on track. I even tried spinning up two sub-agents but even after years of trying to prompt them, they are almost useless in terms of coding ability, so that is looking to be a waste of spending at least so far but maybe the model will improve as time goes on.

Comment by bananadonkey 1 day ago

My sub agent has been looping for almost 10 years at this point and has so far written 0 lines of code. Definitely won't be investing in another...

Comment by HPsquared 1 day ago

I personally get about 50 tokens per hour.

Comment by deployementeng 1 day ago

partially yes.

Comment by queeshonda 19 hours ago

Yes, your mom

Comment by DetroitThrow 19 hours ago

No.

Comment by syngrog66 1 day ago

pre-replaced it with combo of my brain, vim, an assortment of other CLI/TUI tools, etc

Comment by salutonmundo 1 day ago

it's called your damn brain.

Comment by cyanydeez 1 day ago

never started. using wither qwne3-xoder-nezt or qwen3.6 35b

if youre shoopping for a new pc, very easy to justify 128gb vram

Comment by 3vo-ai 52 minutes ago

[flagged]

Comment by 4 hours ago

Comment by sanchitmonga22 5 hours ago

[flagged]

Comment by hectortemich 6 hours ago

[flagged]

Comment by fouadlvlup 17 hours ago

[flagged]

Comment by Littice 1 day ago

[flagged]

Comment by kordlessagain 1 day ago

[flagged]

Comment by o2zer0cool 17 hours ago

[flagged]

Comment by HardAnchor 1 day ago

[flagged]

Comment by advertum 17 hours ago

[flagged]

Comment by echoforgex 9 hours ago

[dead]

Comment by thousandflowers 20 hours ago

[flagged]

Comment by KaiShips 1 day ago

[flagged]

Comment by huangchengsir 17 hours ago

[flagged]

Comment by daischsensor 1 day ago

[flagged]

Comment by aplomb1026 1 day ago

[flagged]

Comment by arggjarvs 1 day ago

[flagged]

Comment by impara 19 hours ago

[flagged]

Comment by hottrends 1 day ago

[flagged]

Comment by mehdibmm 18 hours ago

[dead]

Comment by pjrog 18 hours ago

[flagged]

Comment by mantlemd 1 day ago

[flagged]

Comment by Pranavsingh431 20 hours ago

[flagged]

Comment by 1 day ago

Comment by phlhar 1 day ago

[dead]

Comment by temilson 1 day ago

[flagged]

Comment by eugmai86 1 day ago

[flagged]

Comment by adam_patarino 20 hours ago

[dead]

Comment by startuphakk 21 hours ago

[dead]

Comment by adam_patarino 20 hours ago

[dead]

Comment by ericmaciver 1 day ago

[dead]

Comment by adam_patarino 18 hours ago

[dead]

Comment by codelong888 1 day ago

[dead]

Comment by nicechianti 17 hours ago

[dead]

Comment by iluvcommunism 1 day ago

[dead]

Comment by shell0x 18 hours ago

[dead]

Comment by aiexpo_app 1 day ago

[flagged]

Comment by lasky 1 day ago

for crying out loud... why would you deprive yourself?

Comment by frabcus 23 hours ago

Long term, getting locked into proprietary software development tools is a bad idea. And these models are extremely proprietary. The ability of the US Government to cancel them at any time is one real recent example of one category of problem.

Back in the 1990s the good C++ compilers were proprietary, eventually GCC and LLVM caught up, and now dominate. The pattern repeats in software development, and there's no reason to believe it won't continue.

Yes, right now it makes sense to use Opus 4.8, but it is good that a significant number of people are using other options, and making sure they work and are ready for when you need them.

Plus it is extremely fun and connecting and hackerish to do local coding with a local model. Try it.

Comment by fouadlvlup 12 hours ago

[flagged]

Comment by tyingq 1 day ago

Anyone doing it with a "rent a GPU over the network" path? Is that at all cost effective for any use case?

Comment by dada216 1 day ago

Local? No. Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.

Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.

Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200

Comment by kertoip_1 1 day ago

Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations