How to setup a local coding agent on macOS
Posted by kkm 5 days ago
Comments
Comment by Aurornis 5 days ago
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.
Comment by freerunnering 4 days ago
Yeah, I didn't write this as a proper developer guide. My screen recording started getting loads of favourites and I started getting messages asking about how I set it up, so just through up a quick rundown of how I setup this test.
I little just saw the Unclothe announcement about "Double the speed" and thought "Ha. I wonder if that will get it fast enough I'd actually be prepared to use it" and had a go at setting it up.
I'd done tests before last year with things like Devstral, but they were always both so slow and dumb, I didn't want to bother.
This finally hit the "wow, this is useable" level of both speed and intelligence.
Comment by Phemist 4 days ago
Are you sure you did not mean Unsloth?
Comment by threecheese 4 days ago
Comment by Phemist 4 days ago
Comment by freerunnering 3 days ago
Comment by liuliu 5 days ago
llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.
Comment by reactordev 4 days ago
Comment by willXare 4 days ago
Comment by lloyd-christmas 4 days ago
Comment by ig0r0 5 days ago
Comment by krzyk 4 days ago
As for oprncode, doesn't the system prompt eat too much of the context? Local models are really constraint in regards contex, and opencode AFAIR uses a 10k of it or some thing close.
Comment by naikrovek 2 days ago
I wish people would stop wasting their outrage budget on things like this and pay more attention to politics.
Comment by krzyk 2 days ago
Technical people are rather good at learning new things, and ollama situation is a good learning experience.
llama.cpp gets you more tokens/s even if you ignore ollama team bad behavior.
Comment by naikrovek 8 hours ago
Politicians count on your apathy so they can get away with their horseshit. You paying attention is kryptonite to crooked politicians.
Comment by ingvay7 4 days ago
Comment by amrtn 3 days ago
Comment by ig0r0 3 days ago
Comment by sleepybrett 5 days ago
Comment by takethebus 5 days ago
Comment by mark_l_watson 4 days ago
Comment by ig0r0 3 days ago
Comment by carter2099 2 days ago
Comment by ig0r0 2 days ago
Comment by carter2099 1 day ago
FWIW, I took the dive on Pi today and I’m really happy with my decision so far
Comment by ig0r0 13 hours ago
Comment by c-hendricks 5 days ago
LLAMA_CACHE="models" ./llama-server \
-hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
...Comment by dofm 5 days ago
-hfd for the draft model.
Comment by c-hendricks 5 days ago
Not knocking huggingface-cli, just find it's much easier for people to try out this stuff when they can just
mise use --global github:ggml-org/llama.cpp
LLAMA_CACHE="models" llama-server \
-hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
--host 0.0.0.0 \
--port 11434 \
...Comment by dofm 4 days ago
—no-mmproj
is also pretty useful if you're doing this just to try agentic coding and you're not processing images/voice. Stops it downloading the multimodal projector.Comment by jumploops 4 days ago
It feels like a GPT-4 class model in terms of "stored knowledge" but is better at long-horizon tool calling than any of the GPT-4 class models.
Running on a 128GB MBP M4 Max, I'm getting ~24 t/s on generation and ~200 t/s on prefill. I was expecting it to feel slow, and it certainly does when e.g. generating code, but it's surprisingly useful as a "machine orchestrator" for simple tasks.
For non-agentic usecases, it's a decent enough model to converse with, and has the benefit of being entirely self-contained/private.
Comment by vladgur 5 days ago
You would not need to follow a blog post with omlx IMHO
Comment by Dotnaught 5 days ago
Comment by dofm 5 days ago
Comment by zmmmmm 4 days ago
sbx policy set-default open
just so the single pi sandbox can talk to localhost? ... this gives me some grave doubts about the rest of it being set up well.Comment by dofm 5 days ago
The Gemma 4 MLX builds I have found so far have been slower at the same quantisation and much slower with MTP.
The built-in web UI for llama.cpp is really quite good once you have chosen your model. Otherwise I quite like LM Studio for tinkering.
One thing I would say is that both Gemma-4 and Qwen 3.6 simply do not need a large chunk of the typical opencode system prompt. Better off without it.
Comment by fouc 4 days ago
Comment by amboo7 4 days ago
Comment by dofm 4 days ago
I'm using the GGUF too; it appears slightly faster in llama.cpp now than current LM Studio but it's not clear to me if that is down to LM Studio having a little more code overhead, older llama.cpp under the hood, or just parameter differences.
Comment by amboo7 4 days ago
Comment by dofm 4 days ago
I have not done in-depth, really controlled testing and there is much about performance tuning I don't understand, but it's fairly clear to me that on an M1 Max, MLX does not have the massive advantage it may have on other machines or other models.
It is wholly possible that MLX is _much_ better on the M3 and up, because the neural engine is that much better.
Frankly I think llama.cpp may simply have caught up quite a lot.
MTP is the same issue. There is always a chance that adding a separate MTP draft model has more compute overhead than it brings in terms of speedup, and since I am using an older machine and the MoE models, I am not actually in a zone where MTP can actually add much. What happens is that there's an enormous advantage in speed handling while the prompt and the early reasoning and it then tails off dramatically to be worse, on average, than non MTP.
(Qwen 3.5 35B shows, possibly, a small advantage if its internal MTP is enabled. But it is small — 10% maybe.)
For the 26B Gemma 4, MLX and MTP combined were noticeably slower than the GGUF is with llama.cpp.
If it were a newer machine with a larger, dense model, I'd definitely expect to see an advantage from MTP, and it is possible that there are some parameters I can tweak (duplicate token penalty, temperature, shared cache stuff) that give MTP more of an edge (keep its successful prediction rate higher).
Either way, it feels like the smallish gain I will see on this particular bit of kit might not be worth the long, long journey down that rabbit hole right now.
Comment by fridder 5 days ago
Comment by benbojangles 5 days ago
Comment by jmkni 5 days ago
Comment by echelon 5 days ago
I'm not Googling much of anything anymore. 9/10 times the information is awful, it's hard to parse out of whatever other spam it's surrounded by. Meanwhile, Claude will just do the thing one-shot or with a tiny bit of refinement.
The gateway to knowledge and getting stuff done is the LLM.
Google Search is a dinosaur.
It feels like we're living a century into the future. Not even smartphones were this cool.
Comment by kingofthehill98 5 days ago
Comment by echelon 5 days ago
https://newsletter.pessimistsarchive.org/p/when-educators-mo...
New decade, same old argument.
It's not
> "Claude, think for me"
It's
> "Claude, be my subordinate and get this done for me"
Instead of complaining on the sidelines, I'm getting a shit ton of work done.
Comment by ultrarunner 5 days ago
Yeah, good ol' present for me too then, thanks.
Comment by wwweston 4 days ago
An argument can be as old as the search engine and hold real value. There are ways in which unreflective search engine use has misled and mistrained people.
There’s always been argument to be had about how we manage and offload attention, what we gain and what we lose when resistance is reduced. It’s part of reflection that’s been necessary in order to make progress solid ground, and is more necessary with non-deterministic tech.
The phrase “Tactical tornados” may be older than web search and describes people who also got a lot done.
Models can be incredibly helpful boosters and situationally effective subordinates… and also patchy as a real engineering IC or org.
Comment by this_user 5 days ago
Nah, you are just producing a bunch of slop and hope that nobody notices.
Comment by sdevonoes 5 days ago
It’s weird when people are proud of doing ton of work. Im the opposite, Im proud that Im doing minimal stuff without llms.
Comment by dominotw 4 days ago
maybe you stopped thinking too much that you dont regonize that you are just producing slop that no one cares about.
AI is now getting humans to produce slop
Comment by coldtea 4 days ago
>"Claude, be my subordinate and get this done for me"
Since "this" is thinking, then the two formulations are equivalent.
>Instead of complaining on the sidelines, I'm getting a shit ton of work done.
Until you no longer have a job and are drowned in slop.
Comment by tobyhinloopen 5 days ago
Comment by iammrpayments 4 days ago
Comment by coldtea 4 days ago
The WALL-E chair-people future.
Comment by dofm 5 days ago
I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.
Fiddling about with local models has done so much for my conceptual understanding of what is going on.
FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.
Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.
Comment by mft_ 5 days ago
Comment by freehorse 5 days ago
Comment by dofm 5 days ago
Comment by smcleod 4 days ago
Comment by mft_ 4 days ago
(I generally find standard 27B too slow to enjoy using, whereas 35B-A3B is pretty snappy.)
Comment by dofm 5 days ago
I do enjoy their different personalities when they are tackling "explain this" type puzzles, though.
Gemma writes so well — like a concise code blogger. It makes you understand that the thing we hate about AI slop writing is specifically the cheesy, marketingese sycophantic ChatGPT tone. It's a choice to sound that way.
Qwen writes more tersely by default, like much english language documentation in Chinese open source projects. A couple of lines, code example, fact, code example, line of blurb.
I use this prompt every now and then with a new model. It's obviously a classic SQL puzzle but I've asked new web developers this in the past (prompted by discovering that a client's subcontractor didn't understand it and was therefore unable to migrate some code from relying on dodgy pre-MySQL 5.x behaviours)
—
I have a MySQL 5 table like this: [id, label, category, score]. It contains a list of items in different categories (text names like cat1, cat2, cat3) with a numerical score. Is there a way I can write a SQL query to find the item in each category that has the highest score, without using a subquery? No two entries in any category share a score.
—I enjoy seeing what it deduces from the subtext.
Without "thinking" mode on, they always initially fail and you need to prompt them to find the answer. With thinking mode, they both produce really nice explanations.
For me, as an old freelancer who is pretty cynical about vibe coding or "agentic engineering", what I really want is an AI tool that can help me start to solve problems and help me find the right terminology or generate some boilerplate I can tinker with. Both of these models do fine at the kind of "starter" writing that I want when I am trying to untangle an idea.
Comment by mark_l_watson 4 days ago
Comment by ljosifov 4 days ago
https://github.com/ljubomirj/ds4/tree/reap-compact-support
DS4F dropping to unusable <10 tok/s only at 784K context (!!) makes a big difference.
Comment by reddit_clone 5 days ago
Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.
My past attempts (with Ollama and various LLMs) were too slow to use.
Comment by hkchad 5 days ago
Comment by dofm 4 days ago
If you're seeking the kind of hands-off claude experience, obviously not. They are slow.
If you want to learn how these things work, train them locally, tinker, play with the code, grasp the fundamentals, or just out of sheer bloody-mindedness and principle refuse to tether the functioning of your application to a cloud API...
Comment by leemoore 4 days ago
Comment by jillesvangurp 4 days ago
I've dabbled with Qwen 3.x and Gemma 4 models a bit. They are alright but not that impressive. And my mac gets super hot if I use them for extended periods of time. It's just not very nice to use locally.
Comment by iluvcommunism 5 days ago
Comment by dofm 5 days ago
But if you just want to play around rather than code, you really might find the Gemma 4 12B model worth mucking about with just so you've gone through the steps. Especially if you want to muck about with image analysis or audio transcription.
If you're writing PHP I think you could even find it good enough. I've been modestly surprised. You can do that basic fiddling with the Edge AI Gallery app, which can enable thinking and has a customisable system prompt and some agent support.
You could also try the 14B Deepseek R1.
Honestly even if it is not good enough, if you are anything like me, I think you'll find that going through this process is really quite educational — it has made a lot of things more concrete for me in a way that I have found reassuring and valuable.
Comment by contingencies 5 days ago
Comment by spike021 4 days ago
Comment by contingencies 4 days ago
Comment by codazoda 5 days ago
Comment by mark_l_watson 4 days ago
Comment by codazoda 4 days ago
> NoSuchKeyThe specified key does not exist…
Comment by c0rruptbytes 4 days ago
omlx + gemma 12b 6 bit + pi
it’s feasible for sure
MoEs for speed (qwen 35b, cohere 30b, gemma 26b)
Dense for more methodical work (qwen 27b [reigning champ], gemma 31b, gemma 12b)
MoE i recommend 5bit+
Dense i think 4 bit is okay
Play with your context size, you don’t really need that much, have lazy loading for tools and mcps
my pi extensions for anyone looking for a skinny quick setup, i have use `--no-skills` right now too:
"npm:pi-codex-goal",
"npm:pi-simplify",
"npm:pi-mcp-adapter",
"git:github.com/elpapi42/pi-minimal-subagent",
"npm:@wierdbytes/pi-statusline",
"npm:@aliou/pi-guardrails",
"npm:pi-lens",
"npm:@juicesharp/rpiv-todo",
"npm:pi-hashline-readmap",
"npm:@mrclrchtr/supi-review",
"npm:pi-cmux",
"npm:@mrclrchtr/supi-context",
"npm:pi-tool-search"
think of local models as "zero sugar" models and that's where we're at right now. I think it's crazy how good these models are compared to last year's frontier modelsComment by krzyk 4 days ago
Also there are smaller, still usefull models that can run on 8GB or less.
Comment by jmkni 4 days ago
Comment by LoganDark 5 days ago
Comment by hanifbbz 5 days ago
One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).
Comment by mark_l_watson 4 days ago
I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.
Comment by bluerooibos 4 days ago
Comment by d4rkp4ttern 4 days ago
https://pchalasani.github.io/claude-code-tools/integrations/...
Comment by ricardobeat 4 days ago
Comment by d4rkp4ttern 4 days ago
Comment by alexwwang 4 days ago
Comment by hmontazeri 4 days ago
Comment by godfathermway 4 days ago
Comment by reenorap 5 days ago
Comment by frollogaston 4 days ago
Comment by ozim 4 days ago
So there is no value in testing quality of answers, but there is value in testing token speed.
You just have to have correct expectations.
Comment by krzyk 4 days ago
For me local models is all about quality, and how to achieve that - e.g. by providing guardrails that test the job done.
Comment by jmkni 4 days ago
It solve the "I'm coding on the plane and need to look up this thing I've forgotten" problem, for me at least
Comment by akman 4 days ago
Comment by namnnumbr 5 days ago
Comment by w10-1 5 days ago
oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.
Comment by amboo7 4 days ago
Comment by anigbrowl 4 days ago
Alas, this video appears not have been linked to the text that describes it. Perhaps I should ask an AI to generate an artistic rendering of the author's description.
Comment by freerunnering 4 days ago
Comment by anigbrowl 4 days ago
Comment by smetannik 4 days ago
Comment by b3ing 4 days ago
Comment by freerunnering 4 days ago
Comment by cdolan 5 days ago
Comment by dewey 5 days ago
Comment by c-hendricks 5 days ago
Comment by freerunnering 4 days ago
Plus a followup one where you see me type the question in and press enter (though that video is with Qwen 3.6, not Gemma 4) https://x.com/Freerunnering/status/2065354101878055038
Comment by bicepjai 4 days ago
Comment by dofm 4 days ago
But there is an incentive not to use it if you want to write an article that uses only open-source tools, because it isn't.
Comment by krzyk 4 days ago
Basically one has two real choices for local LLMs: llama.cpp (if single user) or vLLM (if multi-user/enterprise).
Comment by stingraycharles 4 days ago
Comment by attogram 5 days ago
Comment by benbojangles 5 days ago
Comment by Obscurity4340 4 days ago
Comment by rectang 5 days ago
Comment by benbojangles 5 days ago
Comment by metadaemon 5 days ago
Comment by CharlesW 5 days ago
Comment by everlier 4 days ago
harbor up omlx opencode
Comment by yesitcan 4 days ago
Comment by k2enemy 4 days ago
When used as a verb, it should be "set up," and when used as a noun, "setup."
Other examples (verb, noun):
log in, login
back up, backup
shut down, shutdown
break down, breakdown
warm up, warmup
Comment by sleepybrett 5 days ago
is this article old? It's not. I'm not sure why he went through all the bother of llama.cpp
Comment by malkosta 5 days ago
Comment by sleepybrett 5 days ago
Comment by freerunnering 4 days ago
Original video: https://x.com/Freerunnering/status/2065275403548168398
And in the blog post there is a table showing the different speeds I got from different engines.
Slowest combo was 38.1 tk/s, and the fastest was 72.2 tk/s. All from "the same" model.
Comment by malkosta 1 day ago
Comment by krzyk 4 days ago
Also Ollama has other issues (like forgetting what it really is - a wrapper).
Comment by koliber 4 days ago
Comment by zftnb666 2 days ago
Comment by flowbarai 5 days ago
Comment by knightops_dev 4 days ago
Comment by aplomb1026 5 days ago
Comment by jlintc 4 days ago
Comment by teiji-tango 4 days ago
Comment by new_usemame 4 days ago
Comment by datadrivenangel 4 days ago
Comment by jkwang 4 days ago
Comment by tosief 4 days ago