Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?
Posted by cloudking 1 day ago
Has anyone here fully swapped Claude/GPT for a local model as their main coding tool, not just for side experiments? If so, please share your setup and performance (e.g tok/s)
Comments
Comment by Greenpants 1 day ago
I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
Comment by lambda 1 day ago
I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
For other chat tasks and translation, I'll frequently use Gemma 4 31B.
For audio, I'll use Gemma 4 12B.
I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
Comment by chakspak 1 day ago
Comment by lambda 1 day ago
The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
In my models.ini, I have this for the Qwen3.6 models:
chat-template-kwargs = {"preserve_thinking": true}
There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.Comment by ndom91 1 day ago
I'll have to give the preserve_thinking a shot.
Comment by jderekw 1 day ago
Comment by thefroh 1 day ago
but for caching, all you are doing is leaving off a fraction of the most recent assistant message generation, which will have little/no impact on cache hit rate.
Comment by stymaar 1 day ago
True, but not a tiny fraction, qwen is very verbose in its thinking traces. And it basically means that for every (nonthinking) generated token you have to compute the KV twice (once as tg, the second one as pp).
Comment by havfo 1 day ago
--chat-template-kwargs '{"preserve_thinking":true}'Comment by anaisbetts 1 day ago
Comment by dnautics 1 day ago
Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
Comment by lambda 1 day ago
Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.
Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.
But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.
So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.
There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.
Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.
Comment by carterschonwald 1 day ago
Comment by thefossguy69 1 day ago
Comment by lambda 16 hours ago
The jinja template is what renders the openai-format request sent by the harness, into the actual string of text that will be tokenized and fed to the model. For models without preserve thinking support, the jinja template drops the reasoning from all but the current turn.
Here is the default jinja for Gemma 4: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_...
{#- Render reasoning/reasoning_content as thinking channel -#}
{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
{{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}
You see that it only preserves the thinking for indexes that are later than the last user message; thinking is only preserved for a single turn (which can include a lot of interleaved thinking and tool calls), once it goes back to the user and the user replies, it will replay the tool calls but not the thinking between them.Here's Qwen 3.6 by comparison: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/chat_t...
{%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
It additionally has a preserve_thinking flag that you can set. If that's set, it will include all turns thinking in the text passed to the model. But you do have to set that, it's not the default.It's possible to modify the jinja file that you're using with a model. Some people do that with models that haven't been specifically trained for it, and report good results; but some report that because it wasn't trained for that, they get worse results if they include thinking from previous turns.
So for models like Gemma, you would have to modify the default jinja to enable this. For Qwen, you can just set the preserve_thinking flag to get this behavior; and apparently they have trained in this mode so you get better results than models that have not trained this way.
Comment by dnautics 1 day ago
Comment by nl 1 day ago
https://sebastianraschka.com/llms-from-scratch/ch04/08_delta...
Comment by LoganDark 1 day ago
I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
Comment by verdverm 1 day ago
Comment by fjdjshsh 1 day ago
What does this mean in June 2026 wrt coding?
To me it sounds like being a "rice cooker skeptic". Some people don't like using rice cookers, some do.
Comment by svantana 21 hours ago
Comment by femto113 1 day ago
Comment by deadeye 20 hours ago
Comment by luipugs 1 day ago
Comment by secult 1 day ago
Comment by luipugs 22 hours ago
Comment by incrudible 23 hours ago
Comment by bluGill 19 hours ago
Comment by incrudible 17 hours ago
Comment by bluGill 17 hours ago
Comment by sfn42 22 hours ago
For example you can just tell it to make a website for a business with a webshop and it'll just generate thousands of lines of code and you have no control over anything. Or you can spend hours/days writing the specification and then have it generate it.
Or you can do what I do and work iteratively one feature at a time making sure everything is exactly the way you want it. I generally solve the problem myself then tell it what to do, or if I'm not sure what the best solution is I might discuss with the AI until we agree on a plan and then have it execute it. Often this leads to me learning useful things, like it will suggest a tool/feature that I didn't know about that's perfect for my usecase or it will identify a problem in my plan that I wouldn't have found until after spending hours on the implementation.
I've always been very detail oriented and I care a lot about code quality, I want my solutions to be clean, consistent and as simple as possible while solving the problem. To me, AI tools let me do that more quickly and better, it's not a compromise it's just flat out better in every dimension. It's about how you use it.
A lot of people seem to think that it's a binary choice, either hand craft a high quality bespoke solution or just vibe code a pile of trash. There's a whole spectrum in between those two, and I think there's a sweet spot where you still maintain control and understanding, it's just much faster and the result is actually better because it's not just you and the knowledge in your brain it's also the AI that practically knows everything - it will teach you things and suggest solutions you wouldn't have thought about, it makes you a better developer. It's a force multiplier and the smarter you are the better you will be at using it.
It's not a replacement it's an enhancement. It's like imagine a developer with Google vs one without, obviously the one with Google will be better because they have access to more information. The AI is like automatic google that just googles everything all the time, things you wouldn't have even thought to Google or things you couldn't possibly formulate a good search term for. With AI you can just show it a screenshot or describe an issue in detail and get a really solid answer a lot of the time. It's like having an expert on standby all the time, sure it's sometimes wrong but most of the time it's not and if you're smart you'll recognize when it isn't.
I'd say anyone who isn't using AI today aren't using their full potential. I don't see how anyone could possibly perform better without this tool than with it. I do see how someone who doesn't care could produce a lot of slop, but the people who refuse to use it aren't that guy. That guy has been using it to produce slop for years already. You can use it to produce top quality code if you choose to.
Comment by HWR_14 1 day ago
Comment by Iolaum 1 day ago
Comment by mahadevank 1 day ago
Comment by adyavanapalli 1 day ago
I didn't do much benchmarking, but anecdotally, I found it to be making less edit errors. YMMV
Comment by pieterk 1 day ago
Comment by ojr 1 day ago
Comment by ClikeX 23 hours ago
Not sure if you intended this to be this philosophical, but this is basically the slogan for modern life now.
Comment by gwerbin 20 hours ago
Comment by disqard 1 day ago
Not everyone can plough $$$$ into hardware right now (more power to those who can), so choosing to rent is an A-Ok strategy.
Comment by tpm 1 day ago
Comment by _zoltan_ 1 day ago
You can. You just don't want to. Huge difference.
Comment by monooso 21 hours ago
You may be, but the topic of discussion is whether anyone is using a local model as their main coding tool.
Comment by _zoltan_ 20 hours ago
Comment by tpm 18 hours ago
For corporate use, if the corporation would break the law sending anything to the open internet or to the US, then you can't use any model that's not hosted in house. And there are many such cases.
Comment by ihateolives 19 hours ago
Comment by danans 1 day ago
And sounds like you haven't factored in the cost of electricity to run that Mac Studio as an LLM machine. Probably get a few more years.
Comment by electronsoup 1 day ago
I find that running better quantization, like Q8 tend to prevent this even though its a bit slower to run, it saves overall time with less churn
Using 3.6-27b is even slower again than 3.6-35b, but I find the accuracy really pays off
Comment by girvo 1 day ago
Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!
One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.
There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!
Comment by gwerbin 20 hours ago
What's your evaluation setup like? It sounds like maybe the best thing to do is have a realistic evaluation that resembles your actual intended workload and workflow, and then just try everything.
Comment by girvo 10 hours ago
That is quite literally what I have setup :)
I have a few codebases I've written over the years that I attempt a suite of specific tasks: code analysis/bug finding, bug fixing, adding features, that kind of thing. I keep track of the results, including wall clock time
>Do you think the choice of quantization matters that much for other models
It hugely matters. Lots more than r/LocalLlama would have you believe, sadly. Some model architectures can handle more aggressive quantisation than others, and it's hard to know ahead of time.
Step handles it surprisingly well (sparse MoE models seem to generally, when the particular layers are chosen to be quantised carefully). Qwen 3.6 27B handles it okay, but FP8 was better... except annoyingly Qwen's official FP8 has worse KLD/perplexity numbers/accuracy than it otherwise should. RedHat's one was better in my testing, though not by a huge amount.
Comment by rhdunn 18 hours ago
I have a custom assert for loop/repeat detection that works well:
def count_repeats(text: str, length: int) -> int:
n = len(text)
pattern = text[n - length : n]
count = 1 # Include the end of the string as matching the substring.
text = text[: -length]
while text.endswith(pattern):
text = text[: -length]
count = count + 1
return count
def repeats(output: str, context: dict[str, any]) -> bool|float|dict[str, any]:
threshold = context.get('config', {}).get('threshold', 3)
count = 0
length = 0
for n in range(1, (len(output) // 2) + 1):
n_count = count_repeats(output, n)
if n_count > count:
count = n_count
length = n
if count >= threshold:
return { 'pass': True, 'score': 1.0, 'reason': f'Output repeats {count} times with length {length}.' }
else:
return { 'pass': False, 'score': 0.0, 'reason': f'Output doesn\'t repeat {threshold} or more times.' }
def no_repeats(output: str, context) -> dict[str, any]:
result = repeats(output, context)
result['pass'] = not result['pass']
result['score'] = 1.0 - result['score']
return result
Just add it to your promptfooconfig.yaml: defaultTest:
assert:
- # ----- The output doesn't repeat/get stuck in a loop.
type: python
value: file://asserts/repeat.py:no_repeatsComment by ttoinou 23 hours ago
Comment by girvo 20 hours ago
Ds4 is impressive for what it is, but it loops and over thinks even more, burning massive wall clock time to not even get great outcomes. It’s also limited to a slow speed on my Spark
Comment by ttoinou 20 hours ago
Comment by girvo 10 hours ago
Step 3.7 is notably better than 3.5
1. Use the official StepFun GGUF, IQ4_XS - theirs is better tuned in my experience than the other quants
2. Temp 1.0 top_p 0.95 sampling parameters for reasoning/agentic coding
3. It's really quite important that you don't quantise the KV cache: it made a surprising amount of difference to the looping and over thinking I found, at least for the quantised version of the model. I'm using the full F16 for K, and Q8 for V
4. Note that it now supports `reasoning_effort: low|medium|high` in your chat_template_kwargs; this is super useful :)
Comment by kristopolous 1 day ago
It's what I use. Fixes the problem
Comment by stared 23 hours ago
Is it that in your case is it different?
Comment by ltononro 1 day ago
Comment by Greenpants 1 day ago
I do (unscientifically) experiment whenever a new capable local LLM (<=130b) releases with a license that permits commercial use. As for knowing my models require more work than Opus, I don't mind still having to puzzle on getting the architecture right. In any case, it forces me to stay in the loop of what's being built, which is a good thing.
Comment by ltononro 19 hours ago
Another POV is that most of the code written in most of my codebases were generated by Codex/Claude, so they would be "stealing data from themselves" in a sense.
I've been working with Transformers/LLM training in 2018-2021 and then now, more recently again. Things are far different. I think they would be more interested in the "how" you got your code to be satisfactory with your guidance than the actual code generated. But mostly I personally trust that they are not really using my trajectories for that (unless I explicitly allow it in the configs)
Comment by kordlessagain 1 day ago
Comment by dumbfounder 8 hours ago
Comment by psychoslave 1 day ago
I'm not familiar with Pi, and not sure which kind of container you are referring to. Something mainstream like docker, or more classic like a BSD jail?
I started to experiment with locale LLMs, through ollama and Lemonade. Enough to throw simple prompts with code excerpts and get small scope code refactors. Though I still struggled to make them work with external tools, like my IDE, so they can be leveraged on to an agentic level with access to a full repository.
That's mainly for work, as they push for using LLMs, though with the new copilote license they provide it doesn't take me even a week to burn the whole token credit.
The tool can be useful, but in my experience without heavy guard rails and loops over tests. I suspect late models to also burn many token into rabbit hole of nonsense hypothesis, instead of doing straight forward correct implemention as you would expect from any entity with such a huge cumulated resources eaten and experimental playground to leverage on. Maybe incentives don't help model provider to minimize sold token, maybe it's just so hard to tame the beast all these bright minds with virtually infinite resources are not good enough.
Anyway, sorry for digression, but I would be extremely interested with a step by step tutorial to make a local LLM work in agentic level, including which kind of hardware is required to make it work properly.
Comment by geophile 1 day ago
Yeah, that edit inability is weird. I’ve updated AGENTS.md to limit editing (as opposed to rewriting) and that helps a little.
Comment by westoque 1 day ago
that's why i use the frontier models because its a senior co-worker vs a junior. if you use the junior for the sake of privacy i think you're missing out on the best insights for a specific task.
Comment by physix 1 day ago
Consumer-grade subscriptions of the frontier models give you superb capabilities per dollar, them being heavily subsidized. But if you're working in an enterprise setting, that won't work. You need to upgrade, and that gets significantly more expensive.
Furthermore, basing the SDLC on leveraging the bargain subscriptions risks falling apart in the future, both from a cost perspective as well as the question of availability (e.g. Mythos).
So from a strategic perspective, going local on the LLM and still achieving great results with the right approach is very relevant.
Comment by willisrocks 1 day ago
Comment by bxk76 1 day ago
Comment by robertlagrant 23 hours ago
Comment by throw10920 18 hours ago
I've wanted the latter quite a bit for Pi, because weaker models like Deepseek V4 have extreme issues with obeying prompts (e.g. I'll instruct it to find a bug but not fix it, and it'll "helpfully" try to fix it anyway), so having a "read-only mode" actually backed by the OS would be very useful.
Comment by SeriousM 12 hours ago
Comment by pieterk 1 day ago
Maybe even more useful than Opus when I have all the constraints to an issue. There is less "knowledge" in the model (I get by with 48GB of RAM allocated to an 8b quant), so it has fewer things to hallucinate about.
I've been getting to know its limits pretty well over the last few weeks and would say it's an excellent code search/replacement/generation* engine.
It's got the "in-context script generation" flow down as well, so it will easily help automate tasks that you describe with text and perhaps example commands, or tools, or skills* that you provide.
*Think of it + Pi as an NLP abstraction layer over grep, or a shell, rather than a jack of all trades + world knowledge all-in-one.
Comment by gwerbin 1 day ago
All of these models also seem to get stuck in long thinking loops, sometimes tripling the tokens of a frontier closed model which is really painful when inference is already on the slow side (on my Macbook).
Comment by 0xbadcafebee 1 day ago
Comment by hparadiz 1 day ago
Comment by bluerooibos 1 day ago
Hold on, what are the specs of your rig? How much RAM?
I've been considering getting an old refurbished 2018 Mac Mini with 64Gb of DDR4 RAM but everything I've read suggests this will be way slower than my 16Gb M1 Pro Macbook.
Comment by linzhangrun 1 hour ago
Comment by hparadiz 1 day ago
I've been meaning to write a blog post but well whatever here's the md.
https://gist.github.com/hparadiz/f3596d00a62d8ebb2dadcc46ee5...
Qwen3.5 9B performed best.
You can absolutely still use this to do some basic stuff like tell opencode to convert a video file from one format to another. But frankly you're better off getting two AMD GPUs. Say a dual 7900XT would get way better performance.
Comment by bandrami 1 day ago
So there's this really amazing program called "man"
Comment by hparadiz 23 hours ago
Comment by bandrami 23 hours ago
Comment by hparadiz 23 hours ago
Comment by cruffle_duffle 19 hours ago
Comment by gmac 23 hours ago
Comment by bandrami 7 hours ago
Comment by ololobus 19 hours ago
Yes, you surely can read man, docs, whatever, then DIY. The point is that in many areas people don’t really want to become an expert, like in ffmpeg cli arguments, they just want the work to be done. Above is an example of agent being able to do it locally, and I think it’s great
Comment by MoonWalk 16 hours ago
I've read a bit on what the various components are. What I don't see in your comment is what you're using to run your model locally. Ollama?
Comment by dotancohen 1 day ago
> you really need to know what you're asking, and be precise
Any chance that you could share some recent prompts to give other HNers a head start on his to approach Qwen? If you are uncomfortable posting them here, my Gmail username is the same as my HN username.Thank you.
Comment by Greenpants 1 day ago
For the time being, off the top of my head, I'd say:
- Prompt Engineering tips & tricks apply here (like being complete in the relevant context you provide in your question, and the specific task(s) the agent should do like reasoning, modifying one file, or trying to fix a complex task all at once (not recommended)).
- If you already know which files the agent should look into, mention them to save time and potentially context.
- In my personal workflow, I write down lots of atomic TODOs needed to solve a problem. As I write it down, I'll notice assumptions I'm making, or the fact that the TODO could still be decomposed further into (atomic) subtasks.
- It's best to get a feeling yourself for how Qwen handles your repository. I noticed if I don't specify an architecture for development, it'll make quick & dirty fixes. If I don't tell it to remove debug statements, it won't. This is what was meant with "be precise" – Claude Opus might think for you and act in your best interest. Smaller Qwen models will just do what you ask them to, and no more. They have design knowledge, but you have to explicitly ask them to "activate" that part of their knowledge.
Comment by thefossguy69 17 hours ago
Comment by dotancohen 21 hours ago
I look forward to that blog post!
Comment by tsss 19 hours ago
Comment by jmuguy 1 day ago
Comment by Greenpants 1 day ago
Two years later, and I'm running Qwen3.6 35b agentically to develop the start of a repository and automatically run tests to then improve on itself. I never thought we'd get here so quickly with LLMs back then.
I'm pretty sure in two years we'll have current Opus-like quality in the 30-100b parameter model range. But at that point, Opus 6.3 will reason along for us so much better still, that we'll still look at those models in awe. It's great to look ahead, but let's not forget to appreciate how effective the current local models already are :)
Comment by jmuguy 1 day ago
Comment by Greenpants 1 day ago
It's personal, but I prefer CapEx over OpEx for this. If you can purchase a device upfront that runs a decent local LLM, you get the peace of mind that your setup won't suddenly change over time and can only get better.
Comment by lambda 1 day ago
Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
Comment by MrScruff 1 day ago
Comment by lambda 1 day ago
Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.
Comment by mapontosevenths 1 day ago
I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.
Comment by lambda 1 day ago
Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193
Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215
Qwen 3.6 produced far more working functionality than Claude 4 Opus did.
Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.
Comment by MrScruff 1 day ago
Comment by lambda 1 day ago
Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.
Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.
It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.
Comment by shimman 1 day ago
For web development (or anything else with an extreme amount of training data) it's number one for sure. You can't beat it at its costs. US companies will not be able to compete on a competitive market, which is why they rely on so much US government protection + corporate welfare.
Comment by make3 13 hours ago
Comment by lambda 9 hours ago
But it is still available on Google Vertex according to OpenRouter (though it's possible that info is just out of date, it's currently quoting 3tps which is unusably slow): https://openrouter.ai/anthropic/claude-opus-4
Comment by zozbot234 1 day ago
Comment by computerex 1 day ago
OpenAI was offering 2x usage at one point and I still used opus just because it's so much more effective.
Comment by lambda 1 day ago
Anthropic has been releasing models named Opus since 2024 with Claude 3 Opus.
Opus has gotten vastly more capable since then.
Local model far surpass Opus 3. They even surpass Opus 4 on most benchmarks.
Sure, if you compare to the latest Opus 4.8 or even 4.6, they're not there yet. But there's a huge difference in performance between 4 and 4.8.
Comment by jkells 1 day ago
When I colloquially say Opus level I really mean Opus 4.5 or later
Comment by lambda 1 day ago
Comment by zozbot234 1 day ago
Comment by rvnx 1 day ago
More and more specialized and ultra-performant chips are going to flood the consumer market. Especially once new hardware foundries will start producing (well if we don't die from WW3 in the interval).
In 10 years from now, when even basic computers will have 128 GB of memory, and phones will have super optimized tuned models, then what will be the point of Anthropic ?
Just use Gemma/Gemini/Siri or whatever.
Pornography and uncensored models is also pushing toward local models.
It's not like needs of people grows exponentially, the needs follow an asymptote instead (they are capped).
The real revolution is offline robots and self-driving cars, but LLMs are already quite maxed.
For programmers, now, what Anthropic offers is like 3% improvement on a known test (like this pelican riding a bicycle), or on questions leaked from benchmark insiders.
It's ok but not like revolutionary (Fable was better but it was unusable, easy 20 minutes per one prompt due to overthinking).
Comment by spullara 1 day ago
https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...
One thing I did change was the context length to 256k rather than 64k.
Comment by nicman23 1 day ago
Comment by awllau 1 day ago
Comment by Greenpants 1 day ago
Comment by motbus3 1 day ago
Comment by agnelnieves 8 hours ago
Comment by nyxtom 1 day ago
Comment by timmit 1 day ago
Comment by klardotsh 1 day ago
Full octane isn't gonna fit on much of anything south of a 128GB machine once adding KV cache.
[1]: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
Comment by GardenLetter27 1 day ago
Comment by lambda 1 day ago
And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
Comment by everforward 1 day ago
Comment by Greenpants 1 day ago
I'll experiment more with the effectiveness of AGENTS.md rules for local Pi agents. I feel like smaller (local) LLMs just lack in attentiveness to elements in the context window, like precise instructions, compared to e.g. Claude models.
Comment by underdeserver 22 hours ago
You are paying for the extra power draw.
Comment by amelius 1 day ago
Comment by nozzlegear 1 day ago
Comment by q3k 1 day ago
We truly live in the dumbest timeline.
Comment by rjblackman 1 day ago
Comment by yieldcrv 1 day ago
matches my experience and a deal breaker
also the context window sizes are too low. I can't operate in 65,000 windows any more because even just reading the code's file structure overruns it and gets me nowhere. Definitely its own art form.
200k context windows and above for me now
I saw a paper last night that should help this a lot though
Comment by Greenpants 1 day ago
In Pi, /new is my best friend and most-used command for sure. For simple tasks (I decompose complex ones anyway since I don't trust small local LLMs to do this for me), the model doesn't need much context, given that I'm proficient in my codebase myself: "I'd like Feature X. Look into files 1, 2 and 3 to make your edits."
Comment by kennywinker 1 day ago
Comment by p0w3n3d 1 day ago
Comment by krainboltgreene 1 day ago
I don't want to be rude, but your linkedin has a sumtotal (generous) of like 8 months of programming as a profession (job title is AI Engineer). The rest is at best programming adjacent. How would you know what either of these situations are really like?
Comment by SoftTalker 1 day ago
Comment by krainboltgreene 1 day ago
Comment by animanoir 1 day ago
Comment by nobody_r_knows 1 day ago
Comment by horsawlarway 1 day ago
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
Comment by rootlocus 1 day ago
Comment by overgard 1 day ago
Comment by booi 1 day ago
Comment by reddalo 1 day ago
Comment by oofbey 1 day ago
Comment by horsawlarway 1 day ago
When I bought, I paid $850 a piece. And I needed one anyways for the gaming I was going to do.
My guess is the next good time to buy is going to be 24-36 months from now, depending on how the AI bubble goes.
---
I'll add to this, I personally don't like Apple hardware (not so much related to the hardware as their company philosophy) but their machines with unified memory (or AMDs latest unified memory offerings) get pretty equivalent speeds to my 3090s, and are probably a much better modern entrypoint to local llms.
There's a reason the joke is that Silicon Valley software devs bought up all the Mac minis for OpenClaw.
You can get a 48gb unified RAM M4 pro mac mini for ~2k. If you're not going to do much else with the machine, it's what I'd pick as my budget inference device right now. Spend a year of claude now, get ~150tok/s for the next decade (plus) for ~free.
If you want more capable and are willing to spend a little more, go with the newer Ryzen AI Max+ 395 machines.
You'll spend less on power too.
My last suggestion would be to go buy an RTX3090 at this point. You can do a lot better for a lot cheaper.
Comment by tracker1 1 day ago
Comment by felooboolooomba 1 day ago
Comment by lloyd-christmas 1 day ago
Comment by freetonik 1 day ago
Comment by augusto-moura 1 day ago
Comment by drnick1 1 day ago
Comment by arcanemachiner 1 day ago
Comment by overgard 1 day ago
Comment by davkan 1 day ago
Comment by justaj 23 hours ago
Comment by yurishimo 20 hours ago
If you set the target resolution to 1080p, not much changes in the render pipeline except the that final upscaling step. To get better quality, the lower resolution is bumped up so there is more data to work with for the upsampling, but the scaling performance can be very hit or miss depending on the game as the engine itself often can play a huge role in rendering performance.
As far as rendering the 1080p image at 4k, yea it works fine, but there will always be little artefacts that remain for those looking for them. 1440p seems to be the sweet spot for gamers today, but 4k is really nice for when you're not gaming as most online video is now made for dual use on televisions.
Comment by lowbloodsugar 1 day ago
Comment by googletron 1 day ago
Comment by kakacik 1 day ago
Also, 2 gens old means bad performance at ray tracing, abysmal path tracing if at all. Pretty sure it can't run smoothly CP2077 in native 4k without dlss upscalers with all on ultra.
Comment by himata4113 1 day ago
Comment by irishcoffee 1 day ago
Comment by matheusmoreira 1 day ago
We should own things, not rent them. We should all do what we can to keep the fabled 2030 agenda at bay.
Comment by driverdan 1 day ago
Comment by jmuguy 1 day ago
Comment by tripleee 1 day ago
How do AMD cards perform with LLMs? A 9070 is sold for ~$600 and has 16GB VRAM
Comment by overgard 1 day ago
Comment by toyg 1 day ago
In a quickly moving field, it's amazing how much money one can save by overcoming FOMO and not living on the bleeding edge. It's like waiting for Steam sales, the games will be just as good.
Comment by overgard 16 hours ago
Comment by toyg 15 hours ago
I have a 5080 too! For me, the key has been dropping Ollama for Llama.cpp, which is not particularly scary to configure anymore and just skyrocketed performance. I download the models with LM Studio, then run them with llama.cpp.
Comment by lambda 1 day ago
16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
Comment by picofarad 20 hours ago
Comment by throwawayffffas 19 hours ago
Comment by nyrikki 1 day ago
Comment by jononor 15 hours ago
Comment by fluoridation 1 day ago
Comment by flowerthoughts 1 day ago
Comment by sieabahlpark 1 day ago
Comment by kpw94 1 day ago
Since you're running quantized (at UD-Q4_K_XL) , check out the "qat" models (unsloth/gemma-4-26B-A4B-it-qat-GGUF) !
- https://huggingface.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF (With "Jun 9 Update: Added MTP support.")
- https://blog.google/innovation-and-ai/technology/developers-...
Comment by me_bx 1 day ago
> Quantization-Aware Training (QAT) [...] allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model
Comment by SubiculumCode 1 day ago
Comment by twothreeone 1 day ago
I've actually tried this exact same model locally as well.. albeit on just a single 3090 at 128k context and I got around 40-60tok/s with Q4_K quantization.
The thing that bugged me the most was really the quality of the output on moderately complex real-world coding tasks. Having to switch between "prompt/vibe" and "manually implement" is such a big context switch burden, because you really have to ask yourself every few minutes if you're "holding it wrong" or the model is just too stupid.
It also doesn't really seem to handle transitions from "low-level implementation detail" to "high-level design" well, e.g., it wouldn't easily render tables and such. With Claude I don't have this issue.. so I think for now my verdict would be that it's not really a viable replacement. I really hope it will be in a few months time.
Oh and I used "aider" to replace claude CLI, which maybe that's also sub-optimal.. I'm not sure. The MCP marketplaces are useful of course, though arguably you could just manually replace them over time.
Comment by horsawlarway 1 day ago
It's prone to thinking longer and more repetitively, again - it's definitely not opus 4.7/4.8.
I've been using pi.dev as my harness for it, and been pleasantly surprised by how nice it feels (I have used aider, but only very briefly and quite a while back - so I can't realistically compare).
I would say it's roughly where I felt claude was a year back - Most of the sessions need to be more "pair programming" and less "I let it run for hours".
I'm a big fan of frequent "human in the loop" style workflows even when I'm on something like opus at work, though. I have opinions about lots of things, and re-inforcing that the model should stop and ask frequently seems to get me considerably better output, without having to "re-roll" if you will.
I've done a good bit of management, and I think it's roughly producing what a junior dev might produce in a day every 5 minutes. And just like a junior dev, you need to be steering it back on track fairly often.
Opus feels more like a mid-level at this point. I can hand it a chunk of work and "leave" but I still get better output if I'm checked-in and watching/steering.
Comment by unethical_ban 1 day ago
I've used Claude Opus to quickly and effectively pound out some 100-200 line scripts that integrate with a vendor's API, and it one-shotted them both almost perfectly.
I wonder if for a lot of these local models, the scope of the AI assistance should simply be smaller: You architect the tools and the function definitions, and then tell AI to implement one at a time? Does anyone do that rigorously?
Comment by cruffle_duffle 8 hours ago
Comment by NamlchakKhandro 23 hours ago
Comment by twothreeone 15 hours ago
Comment by Applejinx 21 hours ago
I regularly use a pocket calculator: either a physical one, or Apple's Calculator app if I want more decimal places. 'Too stupid' isn't a thing for me if it can cough up some math that would be inconvenient for me to work out by hand. tok/s also isn't a concern because I'm not expecting more than I can read. My ideal scenario would be occasional diversions into querying a 'coding calculator' that can give examples along the lines I want, my way.
I'll make a mental note that Qwen shows signs of being the kind of calculator I'd use for a specific task. Context switching isn't a burden if you're not looking to switch over to vibe/manage and stay there.
Comment by twothreeone 15 hours ago
Is it _objectively_ more productive? I doubt there's a clear-cut answer in the long run (my main suspicion is that since you're essentially creating 10x unnecessary complexity, you'll likely never recover from all the cruft and maintenance kills you in the end - maybe people will find solutions for that though).
Comment by gonzalohm 1 day ago
Comment by horsawlarway 1 day ago
A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.
Sometimes that matters, a lot of times it doesn't.
On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.
I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).
Comment by mirekrusin 1 day ago
Comment by l332mn 1 day ago
Comment by agup792 1 day ago
Comment by anhtqweb 1 day ago
Comment by codinhood 1 day ago
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
Comment by pyeri 1 day ago
The present Sonnet/Opus versions (~4.8) will likely be what everyone in the enterprise might end up using eventually. And even though local models aren't there yet, there are budget alternatives from the families of DeepSeek, Kimi, GPT, MiniMax, etc. available through APIs of NVidida, OpenRouter, Groq, etc. which are very much Sonnet grade.
Comment by codinhood 1 day ago
Personally, I don't think we're at that point yet. While I do think model improvement is starting to plateau (reaching a local ceiling), I'm not convinced local models are as good as sonnet/opus yet. The gap is still too much. But I'm excited for those models to reach those levels.
Comment by bob1029 1 day ago
At my current pace it would take me until sometime late 2030 to spend the same amount in gpt5.5 tokens.
Comment by anonzzzies 23 hours ago
Comment by jonfw 21 hours ago
Remember that there are other LLM providers, open models, and previous gen models, that are way cheaper that frontier Claude and still way better than what can realistically run locally
Comment by mark_l_watson 1 day ago
With a layered approach we can slowly shift to running more locally and still get required work done. Really, my local setup is so much better than it was 2 months ago, and extremely better than 6 months ago - on the same hardware.
Comment by sakopov 1 day ago
Comment by phyzix5761 1 day ago
Comment by Gigachad 1 day ago
Eventually I think it will even out but right now the hosted stuff is very subsidised.
Comment by kristopolous 1 day ago
It's not really a bitter lesson here, I can scale those 4B models easier than someone can scale their 1000B models.
Comment by gunapologist99 1 day ago
If you truly believe that it WILL get there within the next couple of years, then you might as well start playing with it now (and, yes, you will be very surprised, especially for shorter/smaller projects or nicely modularized larger projects)
Comment by MadrasThorn 1 day ago
Comment by jrm4 1 day ago
I think it strongly remains to be seen whether e.g. tokens per second (multiplied or whatever by percieved quality of private model) actually means "better or more useful output."
I strongly suspect it does not. (though I also strongly suspect this will be very difficult to measure because the incentive to lie about metrics here will be so strong.)
Comment by codinhood 1 day ago
What I’m saying is that if local models were actually comparable to Claude Code in practice, we wouldn’t be having threads like this. It would be obvious to the people using them, and it would be massively disruptive. Why would individuals and companies pay hundreds or thousands for Claude Code if they could run something locally and consistently get similar results?
Every month I revisit the local ecosystem hoping the answer has changed. So far, my experience has been that it hasn’t.
Comment by jrm4 1 day ago
It's entirely possible Claude is just winning the hype game.
Comment by robertlagrant 23 hours ago
Comment by jrm4 18 hours ago
Comment by robertlagrant 17 hours ago
Comment by jrm4 9 hours ago
Comment by Rastonbury 1 day ago
Comment by NamlchakKhandro 22 hours ago
Claude Code is not Claude Opus/Sonnet/Haiku.
Comment by reassess_blind 20 hours ago
Comment by bluejay2387 1 day ago
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
Comment by heipei 1 day ago
I have become so "lazy" (in a good way), so far that I've started using the model for lots of daily mundane things on top of just coding:
* "commit this on a branch, push, create a PR and assign $nickname for review"
* "Use the Stripe CLI to download all open and overdue invoices and reconcile them with this CSV export from our bank account."
* "Use these Elasticsearch credentials to summarise what kind of operations are causing load at the moment."
* "Tell me if our codebase already supports X and where it's implemented."Comment by amarshall 1 day ago
Comment by heipei 1 day ago
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0 --metrics --jinja --chat-template-file chat_template.jinja --chat-template-kwargs '{"preserve_thinking": true}' --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-p-min 0.75 -ngl 99 -c 131072 -fa on -np 1 -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_KComment by lloyd-christmas 1 day ago
Comment by bo1024 1 day ago
Comment by user43928 1 day ago
It feels like anything less than Sonnet is just a waste of time, apart from use as a smarter search function.
It also strikes me as strange that you would mention Codex for UI polish, as it's notoriously bad at UI, and far behind Claude Opus. Altman specifically posted that they are working to improve this for the next model release.
Comment by sejje 1 day ago
All the drudgery.
Comment by user43928 1 day ago
I almost find it offensive when colleagues open a MR with an obvious slop description that's frequently inaccurate.
That said, I find AI useful for a lot of drudgery like resolving merge conflicts or splitting changes out into separate MRs.
Particularly with the latter I had issues with small models, they butchered the changes I wanted moved. Not even on the second attempt did GPT 5.4 mini manage to move 10-20 lines to another file without modifying them in the process.
Comment by htrp 1 day ago
Comment by amarshall 1 day ago
The trade-off of MoE is that it is worse but faster for the same total size.
Comment by electronsoup 1 day ago
Comment by pierotofy 1 day ago
Comment by jacobgold 1 day ago
That sounds great for hobbyists but IMHO it wasn't until Opus 4.6 was released six months go (Dec 25, 2025) that we had a model good enough for professionals to use as a primary driver of their coding agents. That seems to be the threshold worth aiming for.
Comment by sbrother 1 day ago
Comment by deaux 1 day ago
Comment by dnautics 1 day ago
in my stuff now i use an OT library that claude put finishing touches on in September.
Comment by storus 1 day ago
Comment by kelnos 1 day ago
Certainly I get a ton more value out of Opus today, but I could absolutely see someone deciding to limit themselves to 8-to-12-months-ago Opus performance for privacy (or other) reasons.
Comment by alexandra_au 1 day ago
Comment by jacobgold 1 day ago
Comment by Projectiboga 1 day ago
Comment by pierotofy 1 day ago
Comment by jacobgold 1 day ago
Comment by vector_spaces 1 day ago
Regardless I don't think it's fruitful to be so condescending with such little insight into this person's situation. Even if you had total insight -- let people be and withhold your judgement, or at least keep it to yourself. Making people feel stupid is a great way to turn people off to pretty much anything else you have to say
Comment by pierotofy 1 day ago
Comment by jacobgold 1 day ago
Comment by lokar 1 day ago
Comment by epolanski 1 day ago
You must be the type of crowd that writes websites with React and Tailwind and pretend to be engineers and have an opinion on everything.
Comment by trueno 1 day ago
i always see great debates with local stuff but the space is constantly moving goalposts and all the vernacular is pretty unfamiliar to me. i'd love to understand what people with objective experience feel they've traded away (or gained) when going local so i can determine for myself if these things are a good fit.
Comment by brycesub 1 day ago
Comment by __mharrison__ 1 day ago
(Shouldn't have done that refactoring job in high mode)
Comment by trueno 1 day ago
Comment by lostlogin 1 day ago
Comment by htrp 1 day ago
Comment by dirkolbrich 1 day ago
Comment by atomicnumber3 1 day ago
Comment by pierotofy 1 day ago
Comment by akulbe 1 day ago
Comment by monirmamoun 1 day ago
Comment by daveidol 1 day ago
Comment by snake_n_my_boot 1 day ago
Comment by NamlchakKhandro 22 hours ago
Comment by lelandbatey 1 day ago
> "Quality is like running edge models from 8-12 months ago"
Don't expect Opus, expect more like Haiku. If you micromanage it, you'll get great results. If you want it to be a human in a box, it'll flounder.
Comment by dheera 1 day ago
I'm looking at https://ollama.com/search and the top few models like kimi-k2.7-code say "cloud" and I can't seem to ollama pull them.
I thought the whole POINT of ollama was not-cloud?
Comment by hoherd 1 day ago
Comment by satvikpendem 1 day ago
Comment by jmorgan 1 day ago
Comment by jubilanti 1 day ago
It was at first, then the developers realized they had a massive userbase they could monetize. A tale as old as open source...
Comment by toyg 1 day ago
Comment by dominotw 1 day ago
Comment by sosodev 1 day ago
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
Comment by argee 1 day ago
I don't think I'd be using AI to code at all if this weren't the case. (I don't want to feel stunted or stuck just from losing my internet connection.)
Comment by user43928 1 day ago
I did not expect perfect reliability, but I thought they could at least get it right on the second attempt once you point out the difference. No such luck, it confidently tells you that now the code is the same, with yet another subtle bug added in the difference.
I don't know what work one would need to do where these garbage-class models would be adequate. Maybe they can masquerade as competent for a few minutes, but in the end the results simply are not right. At best they are suitable for a smarter search or autocomplete, in my opinion.
Comment by Applejinx 20 hours ago
Of course this doesn't produce a useful person who always makes right choices, but isn't it interesting that you can compress that heavily and draw results out in such a casual way? Seems this remains relevant.
Comment by what 1 day ago
Comment by user43928 1 day ago
Comment by Kostic 1 day ago
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
Comment by fitzn 1 day ago
Comment by Kostic 1 day ago
``` [ { "name": "http://127.0.0.1:8888/v1", "vendor": "customendpoint", "apiKey": "llama.cpp", "models": [ { "id": "gemma4-31b", "name": "Gemma 4 31B", "url": "http://127.0.0.1:8888/v1/chat/completions", "toolCalling": true, "vision": true, "maxInputTokens": 65536, "maxOutputTokens": 8192 }, { "id": "qwen3.6-27b", "name": "Qwen 3.6 27B", "url": "http://127.0.0.1:8888/v1/chat/completions", "toolCalling": true, "vision": true, "maxInputTokens": 180224, "maxOutputTokens": 8192 } ] } ] ```
[0] https://code.visualstudio.com/blogs/2025/10/22/bring-your-ow...
Comment by khimaros 1 day ago
Comment by stymaar 1 day ago
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
Comment by manmal 1 day ago
Comment by anana_ 1 day ago
Comment by stymaar 1 day ago
Comment by arjie 1 day ago
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
Comment by akersten 1 day ago
Where did you find/order these? All the sites I can find are either out of stock, only sell to businesses, or are otherwise sketchy...
Comment by arjie 1 day ago
Comment by zackify 1 day ago
Comment by CamperBob2 1 day ago
No affiliation, I've just ordered from them a few times.
Comment by leptons 1 day ago
Comment by ux266478 1 day ago
The expensive part is the upfront hardware cost and the electrical system upgrade you'll need to give your house.
Comment by leptons 1 day ago
Comment by mtone 1 day ago
- Prefill: ~10K tok/s
- Decode: 190 | 375 | 980 tok/s (for 1 | 4 | 16 concurrent requests)
- GPU power draw during benchmark: Average: 585W | Max: 849W | Limit: 1200W with undervolt. Idle PC is 125W.
I've asked it to calculate the following considering a realistic blend of cached prompts and decode for agentic dev scenario.Electricity-only (@ USD $0.08/kWh)
Usage | IN price | OUT price | Monthly cost
Concurrency=1 | $0.040/M | $0.080/M | $8.65 to $38.88 (5% to 100% active)
Concurrency=4 | $0.024/M | $0.044/M | up to $48.67 (cheaper per token but higher power draw)
Total cost of ownership over 3 years is electricity + USD $20K (pre-hike pricing). In a production scenario, how much would I have to charge my users to break even, aiming for 4 concurrent requests 24/7?A) Breakeven API pricing (est. 2B IN + 1B OUT throughput/month):
IN price OUT price
Self-hosted $0.121/M $0.363/M
OpenRouter (budget) $0.098/M $0.196/M
OpenRouter (DeepSeek) $0.140/M $0.280/M
B) Breakeven subscription (users active ~1.5h/day): 1 user: $563/mo (oh, hai)
25 users: $23/mo
100 users: $6/moComment by antirez 20 hours ago
Comment by zozbot234 19 hours ago
Of course this requires wide enough batches to have at least some reuse of fetched experts across a batch, but that seems feasible in the "unattended" case, where firing off multiple inferences to be processed together seems quite natural. (We may also have some benefit from better use of the resident experts cache and/or of SSD transfer bandwidth.)
https://github.com/antirez/ds4/issues/275 seems to provide intriguing rough results while https://github.com/antirez/ds4/issues/314 is a valuable contrast where one commonly suggested solution ("just run multiple instances of the engine in parallel") ran into real issues. Neither of these discuss the combined use of batching and SSD streaming yet, so there's room for experimentation.
Comment by arjie 1 day ago
Comment by mtone 1 day ago
Comment by arjie 23 hours ago
Comment by CamperBob2 1 day ago
There may be a way to get the 2-bit quantized version running even faster on a pair of them.
Comment by arjie 23 hours ago
Comment by CamperBob2 16 hours ago
Comment by garethsprice 1 day ago
I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.
This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.
It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).
The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
Comment by Roark66 19 hours ago
I do mostly scripting, devops, data processing and systems stuff (ansible playbooks, managing network devices, deploying new software for various things that involves reading docs, writing helm charts, modifying existing ones etc).
All other models Gemini, Chatgpt, grok and all OS models don't come even close. I'd rather use Sonet than Qwen.
It's a sad reality. I was thinking about implementing maybe some sort of "sanity checking" by running every prompt twice on two different models doing sanity checking of the first on the second.
Elaborate knowledge systems help a little, but personally I think Anthropic must be doing something "clever" with its models (processing via multiple models etc). Nothing else in my mind explains the discrepancy.
Comment by ac29 18 hours ago
I get this, though the pace of Chinese releases is relentless. Qwen3.7 Plus/Max (closed variants) feel notably better than Qwen3.6, and Minimax M3 is a big jump from 2.7 in capability as well. Both of these families had their previous major release less than 90 days ago.
Anthropic must have Sonnet 5 either waiting or cooking though, they said smaller and larger models than Opus were coming and we already briefly had the larger model.
Comment by hamsterhooey91 17 hours ago
I've also used Composer2.5 on hobby projects and it is definitely on-par with Opus 4.8 (thinking mode: medium), but much faster.
Do you think you're getting better results with Claude because your agent stack (skills, MCPs, etc.) are configured for it and not for the others?
Comment by goranmoomin 1 day ago
A point that I haven't seen come up a lot, but is very valuable to me is that for open source models, I can select the inference provider myself (even if it's not a local GPU), which means that I can enjoy superb speed (i.e. 300 tok/s) while still spending much less than the big providers.
My experience is that if you were fine with the coding models of yesterday (i.e. Claude Opus from Jan/Feb of 2026), you will be fine with either Kimi K2.6 or DeepSeek v4 Pro. Kimi is a bit more smart but has only 256K context and the performance deteriorates (and sometimes just gets stuck) when it fills up the context window. DeepSeek v4 has a 1M context and performs just as well with much less issues. And they both generate very idiomatic code, gives the same vibe of Opus a few months ago.
Since it's also fast (and does not fixate on trying to fix impossible problems, unlike the recent Opus/GPT 5.5 models), a big benefit is that you still control and steer the coding agent and you won't be losing focus like the major models. They are smart, but they don't fixate as much on trying to do stupid things, and since it's fast, you can just interject. It's a much more pleasant experience than the latest models.
I still use the latest models time to time when I expect the agent to fixate all of the problems and figure out everything themselves, but for me open source models are like 80~90% of all of my sessions.
Comment by jborak 1 day ago
I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!
You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.
Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.
I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.
My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
Comment by zakisaad 1 day ago
At first thought, they are quite skewed toward compute (vs VRAM), which is great for gamers but not so great for running LLMs.
(I run a 5070 in my desktop)
Comment by jborak 17 hours ago
I did some math/shopping as well. To get 48GB of VRAM you can get 2x 3090s but that is $3k. A single 5090 is $4k but has 32GB, great for running models like Qwen 27B but maybe nothing else depending on your model settings. Already having 2x 5070, where each card is around $600, it made sense for me to get two more which was $1200 and the memory speeds aligned.
The best value option if you're building from scratch is go with 5060 ti (16GB VRAM). Each of those cards are $570/each on Amazon, cheaper than 4x 5070's. Only downside is memory speed is slightly slower, but you wind up with 64GB of VRAM and you can run big models and small models alongside each other comfortably.
In my setup I ran Qwen3.5 9B for fast inference on simple things and Qwen3.6 27B Q6 for coding work. But I ran into stability issues, so I use llama-swap to dynamically swap models. But with 64GB of VRAM, you wouldn't have that issue. There is overhead to loading LLMs into VRAM that isn't clear, so having extra VRAM is a helpful buffer.
Comment by wsintra2022 1 day ago
Comment by epolanski 1 day ago
Comment by cuttysnark 1 day ago
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
Comment by pianopatrick 1 day ago
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
Comment by sowbug 1 day ago
Some think the human brain works similarly: thousands of mini-brain cortical columns, each with a slightly different take on the situation, voting in a majority-rules system.
Comment by grmnygrmny2 1 day ago
The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).
It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.
I’m using Pi as my harness.
Comment by mgsram 1 day ago
The tokens/sec may be less but that kind of helps me in going at the right pace. The workflow I use for green field development / rewrites is to pair with Sonnet for design/architecture, reasoning and a detailed execution plan. I then feed this piece by piece with precise prompting and that does the job. For brown field, it is often a judgement call. There are occasions when I have found Local models to be limited in their reach and I resort to Claude Code
Some of my recent work using Qwen 3.6: 1. Complete rewrite of Power management Service in C using the existing C++ code as reference 2. Tool to parse contents from really complex specifications in Excel format 3. Tool to translate CJK contents to english for feeding into KG
Comment by russelg 1 day ago
Comment by mgsram 1 day ago
Comment by jodoherty 1 day ago
I find it useful.
This side project highlights a similar approach to how I scope and tackle projects at work now:
https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md
https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...
You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.
I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.
Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.
My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
Comment by yesb 11 hours ago
Some issues:
1. `wg-wrap healthcheck` was all green even though unprivileged user namespaces was restricted via AppArmor (Debian). That check doesn't seem to work
2. DNS doesn't work (no domains resolve) if the config lists multiple servers e.g. DNS = 1.1.1.1, 1.0.0.1
3. Peer endpoints don't support domain names, only IP addresses
4. Minor: the tool doesn't add an implied /32 cidr prefix for single ip configs (common from some VPN providers).
Comment by jodoherty 10 hours ago
If I get some time to circle back, I'll be sure to incorporate these into some new tests and address them.
I want to set up a qemu-system emulator based testing approach so I can incorporate things like AppArmor and SELinux into end to end tests that include different environment configurations.
Part of that will be setting up software defined networking so I can have things like DNS and wireguard VPN servers in a box and then test and evaluate the wg-wrap behavior at the packet level.
Comment by HappySweeney 1 day ago
Comment by macwhisperer 15 hours ago
the harnesses themselves are just as important as the models...different harnesses give different responses with the same prompt, same model...
if you have the 20/mnth claude sub or codex, you really should be using that to build a good local harness for yourself... claude won't be 20$ forever
build the stack first! when you get that new comp with massive ram, youre already set, just run a larger model!
big cloud models are incredibly good at building and teaching about local ai!
have fun in the rabbit hole!
if you are memory constrained like me, check out my custom models https://huggingface.co/macwhisperer
Comment by henrixd 1 day ago
I'm running this on V100 32GB (~900GB/s memory bandwidth) with 200,000 context window, --spec-type mpt --spec-draft-n-max 3 --spec-draft-n-min 0 --cache-type-k turbo3 --cache-type-v turbo3 to mention most relevant parts.
I usually get somewhere 45-60 t/s. I believe that speed could be improved slightly by switching to ik_llama.cpp fork and Qwen3.6-27B-IQ4_NL.gguf -model but there's no turboquant support and it's with some other tradeoffs too.
Comment by GodelNumbering 1 day ago
Comment by blurbleblurble 1 day ago
Comment by coder543 1 day ago
It's also annoying that OpenCode doesn't even try to support local LLMs properly.
Getting OpenCode to work is possible, but extremely manual and clunky to configure. I have written a script to automate converting my llama-server configs into an OpenCode config, and that helps, but it's not ideal.
I have seriously considered writing Yet Another Coding Harness in my free time. I have some ideas for what would make it nice.
Comment by wsintra2022 1 day ago
Comment by zackify 1 day ago
Comment by horsawlarway 1 day ago
I've used the cli agents for claude, cursor, and pi, plus several custom harnesses I've written myself from time to time as experiments (and I guess technically gastown, if we're calling that a harness).
Pi is... just fine.
It does what I need it to, has a decent selection of tooling out of the box, integrates nicely with other tools, and generally gets out of my way enough that I don't think about it much anymore.
If you can run ~30b models at decent speeds, I think most folks would be pleasantly surprised at how capable they are with pi.
Tack on some of the extensions (ex https://pi.dev/packages/pi-mcp-adapter?name=mcp and https://pi.dev/packages/pi-web-access?name=search) and I get web tooling (ex - perplexity search), access to mcps to do things like drive chrome (https://browsermcp.io/) or firefox (https://github.com/mozilla/firefox-devtools-mcp)
It's fine. Is it as good as a subsidized top tier model? Nope. Is it free and still very capable? Yup.
And personally, I've been having a LOT of fun with the pi sdk (https://pi.dev/docs/latest/sdk)
Which is something that all the other providers charge you api access rates for (ex - thousands a month).
Comment by Insanity 1 day ago
Comment by bityard 1 day ago
Comment by horsawlarway 1 day ago
But yes - it expands a lot if you're willing to play with it.
I'd actually say the vscode comparison is wrong, because vscode is very much "bring your own extension" in the same way that Pi is. While Claude is much more "visual studio" vibes. It's thick, it's opinionated, and it's absolutely not something you can really customize, but it can feel slick for supported workflows.
Comment by pianopatrick 1 day ago
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
Comment by cheekygeeky 1 day ago
Comment by _bobm 1 day ago
One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.
As a matter of fact, think about these operations, api endpoints, observe their output.
These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.
I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.
The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.
Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.
I am sure that if you change your perspective you will start noticing the scale of the "magic".
Comment by XCSme 1 day ago
I don't understand, why does it make you think this is the case?
> how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself
Can you give an example?
Comment by _bobm 1 day ago
Sure, connect opencode to an openai/chatgpt endpoint and use it. You will notice multiple "thinking" parts per "turn".
I put all of these in quotation because... they are part of the orchestration game. For example, it is not known if the thinking parts of a particular turn are chain of thought thinking summaries or just plain response which is masquaraded and thus orchestrated into appearing as thinking.
Further notice the cadence, word choice and sentence formation. Notice sentence construction. Notice "thinking part" construction and sequencing.
There is pretty heavy orchestration.
> I don't understand, why does it make you think this is the case?
Because not all tokens are equal. And if you waste expensive tokens on mundane tasks you will go out of business. This is the reason.
As I said, if you observe the output from these api endpoints you will notice it.
Comment by XCSme 1 day ago
I thought that was the code harness simply minifying the outputs. Many models now no longer return the entire chain-of-thought (to avoid distillation attacks). So yes, we don't get the raw LLM output, but I think it's just the thinking summarized, not a complex orchestration or different models.
I do agree though that now cloud models are kind of a black box, that's not only obfuscated but also changes over time. Companies seem to be changing model capabilities without notifying users, or even hiddenly serving completely different models. This is even worse via OpenRouter, with providers serving open-source models, some of them serve heavily quantized versions or even completely different models.
Comment by _bobm 1 day ago
Last time I checked, OpenAI even send (in the response) the summary of the thinking part alreafy in markdown, so opencode has to remove the formatting to format it to their liking.
> Many models now no longer return the entire chain-of-thought (to avoid distillation attacks).
This is what they say: to avoid distillation attacks. And to some large extent this is true. I am saying there is a side- effect and this side- effect (depending on how tin-foilly you want to go) may be either a nice thing to have or it may be the "main reason" for all of this.
The side effect is splicing the inference, brokering requests, and what not, which brings huge benefits at scale.
This was my original point: openweights model to a sota model may be apples to oranges. So when will a local model catchup with its single cot run which is not even shaped properly: well never.
It is apples to oranges.
Comment by XCSme 1 day ago
Comment by _bobm 1 day ago
But what they do not have is the correct shape, the correct approach. This is missing and it shows on multiple scales: it shows in the COT, it shows in the output itself, it shows in the infra to serve the models, it shows in the model orchestration.
This is what anthropic said one year ago:
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.
Comment by acc_297 1 day ago
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
Comment by htrp 1 day ago
Comment by rolisz 1 day ago
About Owain Evans work: I think he did SFT. On Twitter someone was saying that RL is not as susceptible to what he showed. I'd like to try that
Comment by bravetraveler 1 day ago
Comment by K0balt 1 day ago
Comment by kadoban 1 day ago
Comment by K0balt 1 day ago
Comment by kandros 1 day ago
Comment by K0balt 1 day ago
I should mention not to run it at less than q6, I prefer q8.
Comment by papichulo4 1 day ago
There's apparently a reason Sonnet and Haiku have been left in previous version #s.
Still encouraging, though, that things are catching up. We can't expect $20k local setups to match $20bn compute clusters.
Comment by nfrankel 1 day ago
Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
Comment by big-chungus4 1 day ago
Comment by neuropacabra 21 hours ago
Comment by rsolva 18 hours ago
Comment by vfalbor 21 hours ago
Comment by mitchell_h 1 day ago
Comment by coder543 1 day ago
Of course, you have to have the right hardware to be able to run with a context window like that, as it takes about 100GB of memory on my DGX Spark to do that with full f16 KV cache on the q4_k_xl model.
Comment by lysace 1 day ago
Comment by tobyhinloopen 1 day ago
It’s slower but you can run them.
Comment by lysace 1 day ago
Comment by deadbabe 1 day ago
Comment by ljosifov 1 day ago
Comment by moezd 1 day ago
Comment by amarshall 1 day ago
Comment by moezd 13 hours ago
Comment by amarshall 7 hours ago
As for the question you’re likely asking: benchmarks that include speed across many models and providers available at various places e.g. https://artificialanalysis.ai/leaderboards/models
Comment by zftnb666 1 day ago
Comment by xmstan 23 hours ago
Comment by 3abiton 1 day ago
Comment by bijowo1676 1 day ago
but then use cheap/local model to implement the specs.
Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files
but this requires second and third passes, to smooth out the rough edges
has anyone tried that?
Comment by CuriousRose 1 day ago
I don't use local hosted models anymore due to hardware contstraints, but I do have some degree of search anonymisation attached to my OpenCode and OpenRouter connected open models.
On my Macbook I run OrbStack that has the following docker containers set to route through a Mullvad based gluetun.
- Firecrawl - fast web scraping
- SearxNG - metasearch
- CloakBrowser - tursile bypassing Playwright alternative
If you wanted to get fancy with the proxy rotation, you could setup numerous instances of Playwright each with their own Mullvad wireguard key in different locations.
Comment by heisenbit 1 day ago
I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.
What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
Comment by SupLockDef 1 day ago
I mostly use it as a google search if I forget a thing, or doing the boilerplates.
I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.
$0.00 / month. That's the budget.
Comment by jboss10 1 day ago
Comment by SupLockDef 1 day ago
I did try 3.6 on my main desktop. It was good, but I didn't see much differences than coder, so I am still using my old rig.
Comment by ryandrake 1 day ago
Comment by riazrizvi 1 day ago
Comment by porkloin 1 day ago
Hardware:
- GPU: AMD 7900xtx, 24gb vram
- CPU: AMD 5950x, AM4
- RAM: 64gb DDR4 3600
Software:
- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)
- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units
- Network: tailscale
- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)
- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.
- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.
Models:
- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.
- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?
- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job
Flags (specific for Qwen 27b, since that's primary model):
- `-ngl 99` offload all layers to GPU
- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing
- `-np 1` single slot (no parallel request handling)
- `--no-context-shift` error instead of silently sliding the context window when full
- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)
- `-b 2048` logical batch size (tokens per submission)
- `-ub 1024` physical micro-batch (per GPU pass)
- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling
- `-fa on` flash attention
- `--spec-type draft-mtp` use the model's built-in MTP as the draft model
- `--spec-draft-n-max 3` propose up to 3 draft tokens per step
- `--spec-draft-n-min 0` allow zero drafts if confidence is low
- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path
- `--reasoning-format deepseek` parse <think> blocks in proper format
- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)
- `--jinja` use the GGUF's Jinja chat template
- `--temp 0.6` moderate randomness (Qwen recommended value for coding)
- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)
- `--top-k 20` top-20 candidates (Qwen recommended value for coding)
- `--min-p 0.0 disabled (Qwen recommended value for coding)
Performance (27b, primary model):
- ~65t/s for token generation
- ~600 t/s for prompt processing.
- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.
- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.
I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.
CLI/Harness:
- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)
- Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window
- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.
A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.
This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.
Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(
Comment by ryandrake 1 day ago
Comment by nake89 1 day ago
Comment by anubhav200 1 day ago
Comment by anubhavgupta 1 day ago
Comment by anubhavgupta 1 day ago
Comment by anubhavgupta 1 day ago
Comment by milchek 1 day ago
Comment by BiraIgnacio 1 day ago
I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
Comment by wuschel 1 day ago
Comment by boringg 1 day ago
Comment by snoman 1 day ago
Comment by NetOpWibby 1 day ago
Comment by trueno 1 day ago
Comment by NetOpWibby 17 hours ago
Comment by zaptheimpaler 1 day ago
Comment by anana_ 1 day ago
Some of the benchmarks appear to back this up [0]
Of course, a lot depends how you are using it (inference parameters, harness, prompting, etc.), but the model is quite important too.
[0]: https://artificialanalysis.ai/models/open-source/small?model...
Comment by Rzor 1 day ago
Comment by etoxin 1 day ago
Most small local models don't get tool calling right, however the larger models are now doing this correctly now.
One thing local has not accounted for, is most productive engineers are running multiple cli chats at a time with git worktrees. I normally hover around 3 worktrees + cli-chats.
Comment by dabinat 1 day ago
Comment by rvnx 1 day ago
Comment by utopiah 1 day ago
It seems pretty intuitive that pouring more resources into a problem (more GPU, bigger GPUs with more VRAM, bigger datasets, better curated datasets, more efficient ways to train, more efficient way to run inference, etc) then running the result for a longer time, with more layers of verification (running in VMs, model fusion comparing multiple models, having harnesses with testing) will at least lead to marginally better results.
Is it worth it and at what pace will it keep on improving are different questions but I have little doubt that if the industry keep on pouring resources, sure more "works".
Comment by yalogin 19 hours ago
Comment by shironnnn_ 1 day ago
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
Comment by michaelhoney 1 day ago
https://vickiboykis.com/2026/06/15/running-local-models-is-g...
Comment by tumetab1 1 day ago
Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
Comment by adam_patarino 20 hours ago
We are post training qwen 3.6 and combining it with a custom inference engine and harness to get the most out of a smaller model.
Comment by cloudengineer94 22 hours ago
I do think we are slowly getting Gemma 4 was a big jump
Comment by ndom91 1 day ago
I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.
Comment by anonymousiam 1 day ago
My Homelab AI Dev Platform
Comment by jderekw 1 day ago
Comment by trilogic 22 hours ago
Comment by bArray 1 day ago
Comment by patates 1 day ago
Comment by mv4 1 day ago
Comment by cmrdporcupine 1 day ago
Comment by daniban 17 hours ago
Comment by xhinker2 1 day ago
Comment by Departed7405 1 day ago
Plus, you now have zero-data retention models, so the privacy argument has kind of faded.
Comment by sj_tech 1 day ago
Comment by thesuperbigfrog 1 day ago
https://discourse.ubuntu.com/t/use-workshop-to-run-opencode-...
Comment by whartung 1 day ago
I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
Comment by fransje26 11 hours ago
No. Apple is also running out of RAM, so you will not have the RAM you need.
Comment by ozten 1 day ago
Open code against Infomaniak hosted OSS models: Qwen3.5-122B-A10B-FP8, Kimi-K2.6.
I use API keys for billing. It performs like Dec 2025 in terms of my productivity back then.
Comment by Lwerewolf 1 day ago
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.
Comment by derekered 1 day ago
Comment by russelg 1 day ago
Comment by jrflo 1 day ago
Comment by cahaya 23 hours ago
Sorry for hijacking the convo, but you (with local models) are my target audience in terms of hardware.
Is anybody willing to test my new app https://document.bot? It is like Cursor IDE but custom harness for knowledge work (PDF's, MS Office files etc).
You can connect your existing offline LLM models through LMStudio, Ollama, or app managed LLM models (Qwen3.5, Gemma 4, etc)
Might have to make a new Ask HN post for this, but again, you are users with good hardware setups.
Comment by 627467 1 day ago
How much does this ware out the hardware?
Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
Comment by jmichaelson 1 day ago
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
Comment by maelito 22 hours ago
Comment by drnick1 1 day ago
- What "stack" do you recommend? Llama.cpp + OpenCode?
Comment by ecshafer 1 day ago
Comment by v3ss0n 1 day ago
Comment by carlossouza 1 day ago
Comment by deepvibrations 22 hours ago
These models are still very capable with good hardware, but they do lack the deep reasoning of major models and require more precise prompting.
So unless you really need the privacy, or have a lot of excess cash, it is not recommended, as considering the price of major models, it's just extremely cost inefficient!
Comment by jmward01 1 day ago
Comment by overgard 1 day ago
Comment by qu0b 1 day ago
Comment by fortyseven 1 day ago
Comment by kristianpaul 1 day ago
Nemotron super 3 110B works well for 1M context long vibecoding sessions
I also use Pi harness with no extension
Comment by c16 23 hours ago
on 64GB M4 I find it's able to do things fairly well. The few times I run out of tokens, I hop over to that and I'm mostly unimpeded. I compare it to the Haiku models, where you have to go in and be surgical about your changes, or like others have said, guide a junior.
on 32GB M5, I find that it works, but around the 30% ctx threshold it slows down quite substantially, so more need to be surgical in your requests. I'll often just have my IDE open and Claude. But maybe I've been too comfortable talking to Sonnet/Opus and so forget I need to be more deliberate in my requests.
My finding here is that the harness is a big part of the problem. CC seems to be very good with Qwen in my experience. Better than OpenCode.
I also run DeepSeek for some other non-structured data tasks and to generate a to-do out of that. That's not coding, so won't go into that, other than to say it's very competent as a small model left to run in the background and automate small parts of my life and process.
tl;dr it's totally doable on a 32gb mbp using ollama, but be precise in your requests and guidance.
Comment by mark_l_watson 1 day ago
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.
Comment by _davide_ 1 day ago
Comment by sosodev 1 day ago
I think it also helps that I'm using my machine to do home server stuff. It excels at all of the traditional workloads. Then I can lean on the AI to help with automation here and there. I find it deeply satisfying.
Comment by _davide_ 1 day ago
Comment by sosodev 5 hours ago
I've also been using Deepseek V4 pro/flash for some work stuff and I do find them to be much closer to frontier capability. I may try running flash at home soon for very patient edits. :)
Comment by hegdeezy 1 day ago
Comment by redox99 1 day ago
It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.
There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
Comment by pbasista 1 day ago
Is that characterization based on some objective facts or benchmarks?
Comment by kube-system 1 day ago
Comment by redox99 1 day ago
Comment by xgulfie 1 day ago
Comment by orangeisthe 1 day ago
Comment by cayley_graph 1 day ago
I suspect many will realize millions more dollars are being spent than needed to achieve the highest marginal productivity gains, and reallocate accordingly. Who wants more of their money going to developer tooling, rather than bonuses?
Comment by orangeisthe 1 day ago
That's way more economical and produces far better result than any self hosted models today.
Comment by pdyc 1 day ago
harness - pi+custom extension for subagents
model - qwen3.6 35ba3b q4km
hardware - intel arrow lake with 32gb ram
server - llama.cpp vulkan
performance - 15-18t/s generation 50-150t/s pp
planning and task creation is still using claude/gpt but they dont touch the code. All coding is done using this setup.
Example of project made using this setup easyanalytica.com , its of medium size complexity
Comment by anuramat 1 day ago
Comment by julianlam 1 day ago
Qwen 3.6 35B-A3B on a Framework 13 with 32GB of memory.
Running llama.cpp, 15 tokens per second. Outputs code and text faster than I can parse.
Comment by bagol 1 day ago
Comment by sukuva 1 day ago
Comment by AH4oFVbPT4f8 1 day ago
Comment by xeonax 1 day ago
Comment by AH4oFVbPT4f8 1 day ago
Comment by SkitterKherpi 1 day ago
Comment by sermakarevich 1 day ago
- smarter models to create tasks
- local qwen3.6:36B for tasks execution
here is how in details https://news.ycombinator.com/item?id=48520757
Comment by ElenaDaibunny 22 hours ago
Comment by catapart 1 day ago
Comment by Rzor 1 day ago
It's faster than I can read, but it feels slow as hell. I think 40-50 tks is probably much more comfortable and I hope I can reach that when trying this on llamacpp soon enough.
[0] - https://pastes.io/9gaARxE8
[1] - https://jsfiddle.net/pou4nbh9/1/
Model: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gg...
Comment by agentbc9000 1 day ago
Comment by bentt 1 day ago
Comment by codelion 1 day ago
Comment by jwr 1 day ago
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
Comment by euroderf 1 day ago
Comment by alimbada 20 hours ago
Comment by euroderf 18 hours ago
Comment by SugarReflex 1 day ago
Comment by chungus 1 day ago
I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.
Power usage is also totally not an issue, AI workload is very different from gaming.
tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
Comment by system2 1 day ago
Comment by ColonelPhantom 1 day ago
You can just about reach the lower end of the latter category with a 128GB machine like a DGX Spark, Framework Desktop, or M5 Max, though those are usually not super fast. For the former category, you can easily run them fast with something like a 3090 or 5090, hell, probably even a 5060 Ti.
Comment by system2 1 day ago
Comment by CamperBob2 1 day ago
Comment by Razengan 1 day ago
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
Comment by joshuamoyers 1 day ago
Comment by SimianSci 1 day ago
(TLDR; Distributed compute for models will require hardware at a level only really possible with data-centers at the moment.)
Token generation operates at such a scale to demand enough from a single GPU as it will often saturate the bandwidth capabilities of consumer grade interconnects like PCIe. Which fundamentally implies that distributing a model's compute across vast distances is too much of a challenge without significant infrastructure.
To give an example, When we split a model's compute between two seperate cards on a single workstation, this doesnt mean we end up with 2x the compute bandwidth for a model. Instead the increase becomes something small like 20% depending on model, because the inconnects (PCIe on consumer hardware) will quickly become so saturated with data being copied between the two GPUs so as to become a bottleneck. And remember that this is something that happens locally with PCIe, which (depending on generation) will cap out at around 20-35 GB/s depending on the generation of motherboard.
Model performance is very much tied to having the fastest and highest bandwidth single card available so as to keep data transfer operations to a minimum as the sheer volume of data necessary for the model to run is immense. I simply cant imagine how slow and unusable a model would be if the copy operations necessary for its compute needed to be performed over unreliable network speeds where there will be significant performance loss as network speeds are not reliably distributed across the globe, and their unreliable nature would demand increased overhead due to data verification.
The dream of distributed AI is a ways off.
Comment by anubhav200 1 day ago
Comment by nynrathod 23 hours ago
Comment by wmedrano 1 day ago
Comment by drnick1 1 day ago
Comment by jboss10 1 day ago
Comment by shironnnn_ 1 day ago
Comment by devin 1 day ago
Comment by christkv 1 day ago
Comment by w10-1 1 day ago
For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").
That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.
One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
Comment by sometimelurker 1 day ago
Comment by major505 1 day ago
Comment by Der_Einzige 1 day ago
The secret to actually good agentic outputs even with small models? Llamacpp has support for this little known sampler called "top-n sigma". You should use that, set it to 1 and set temperature to literally whatever you want (it could be infinity) and your model will just magically work to your maximum context window. That's because long context generation is a sampling problem.
Comment by devmor 1 day ago
The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.
Comment by hacker_homie 1 day ago
Comment by lowbloodsugar 1 day ago
Comment by platevoltage 1 day ago
Comment by epolanski 1 day ago
Albeit I plan to move to local ones when I will get my hands on a 256+ GB macbook.
Local inference is good enough to help me with my daily job, and doesn't turn me into an assistant to the LLM.
Comment by jay_kyburz 1 day ago
If I give it a page of context, can it write a linked list or identify a bad line of CSS?
Is there anywhere online I can chat with a model I could be running at home to see how good it is?
Comment by thrownaway561 1 day ago
67M Ouput 51M Input
Total $0.83 dollar.
I honestly don't understand why people just don't use DeepSeek.
Comment by ThomasGlanzmann 1 day ago
Comment by codemk8 1 day ago
Comment by ThomasGlanzmann 1 day ago
Comment by slvnx 20 hours ago
Comment by thrownaway561 12 hours ago
Comment by jeffrallen 1 day ago
Comment by gigatexal 1 day ago
Comment by dude250711 1 day ago
Recommended setup: plenty of nutrients, some caffeine and a quiet environment.
Performance - not currently measured in tokens: roughly average.
Comment by jasongill 1 day ago
Comment by bananadonkey 1 day ago
Comment by HPsquared 1 day ago
Comment by deployementeng 1 day ago
Comment by queeshonda 19 hours ago
Comment by DetroitThrow 19 hours ago
Comment by syngrog66 1 day ago
Comment by salutonmundo 1 day ago
Comment by cyanydeez 1 day ago
if youre shoopping for a new pc, very easy to justify 128gb vram
Comment by 3vo-ai 52 minutes ago
Comment by sanchitmonga22 5 hours ago
Comment by hectortemich 6 hours ago
Comment by fouadlvlup 17 hours ago
Comment by Littice 1 day ago
Comment by kordlessagain 1 day ago
Comment by o2zer0cool 17 hours ago
Comment by HardAnchor 1 day ago
Comment by advertum 17 hours ago
Comment by echoforgex 9 hours ago
Comment by thousandflowers 20 hours ago
Comment by KaiShips 1 day ago
Comment by huangchengsir 17 hours ago
Comment by daischsensor 1 day ago
Comment by aplomb1026 1 day ago
Comment by arggjarvs 1 day ago
Comment by impara 19 hours ago
Comment by hottrends 1 day ago
Comment by mehdibmm 18 hours ago
Comment by pjrog 18 hours ago
Comment by mantlemd 1 day ago
Comment by Pranavsingh431 20 hours ago
Comment by phlhar 1 day ago
Comment by temilson 1 day ago
Comment by eugmai86 1 day ago
Comment by adam_patarino 20 hours ago
Comment by startuphakk 21 hours ago
Comment by adam_patarino 20 hours ago
Comment by ericmaciver 1 day ago
Comment by adam_patarino 18 hours ago
Comment by codelong888 1 day ago
Comment by nicechianti 17 hours ago
Comment by iluvcommunism 1 day ago
Comment by shell0x 18 hours ago
Comment by aiexpo_app 1 day ago
Comment by lasky 1 day ago
Comment by frabcus 23 hours ago
Back in the 1990s the good C++ compilers were proprietary, eventually GCC and LLVM caught up, and now dominate. The pattern repeats in software development, and there's no reason to believe it won't continue.
Yes, right now it makes sense to use Opus 4.8, but it is good that a significant number of people are using other options, and making sure they work and are ready for when you need them.
Plus it is extremely fun and connecting and hackerish to do local coding with a local model. Try it.
Comment by fouadlvlup 12 hours ago
Comment by tyingq 1 day ago
Comment by dada216 1 day ago
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200
Comment by kertoip_1 1 day ago