How to setup a local coding agent on macOS

Posted by kkm 5 days ago

Comments

Comment by Aurornis 5 days ago

> The benchmark prompt was:

> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

> Each benchmark generated about 128 tokens.

Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.

llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:

https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...

EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.

Comment by freerunnering 4 days ago

> I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.

Yeah, I didn't write this as a proper developer guide. My screen recording started getting loads of favourites and I started getting messages asking about how I set it up, so just through up a quick rundown of how I setup this test.

I little just saw the Unclothe announcement about "Double the speed" and thought "Ha. I wonder if that will get it fast enough I'd actually be prepared to use it" and had a go at setting it up.

I'd done tests before last year with things like Devstral, but they were always both so slow and dumb, I didn't want to bother.

This finally hit the "wow, this is useable" level of both speed and intelligence.

Comment by Phemist 4 days ago

I wasn't familiar with Unclothe, so I had to look it up..

Are you sure you did not mean Unsloth?

Comment by threecheese 4 days ago

They likely did, and this autocorrect slip might suggest why OP is using local models :)

Comment by Phemist 4 days ago

Indeed, a clear Freudian slip. The one where you say one thing, but you mean your mother.

Comment by freerunnering 3 days ago

For some reason every time I type "Unsloth" macOS auto corrects it to "unclothe". It did it now, writing this reply. It's really annoying!

Comment by liuliu 5 days ago

Realistically, you need to experiment with any user prompt + a good amount of system prompt (at least > 1000 tokens, but realistically, in the range of 3000 tokens probably good).

llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too.

Comment by reactordev 4 days ago

This is akin to saying “it runs on my machine” without actually examining the problem. Sad. You’re absolutely right that 128 tokens is nothing, it’s a little more than a hello response.

Comment by willXare 4 days ago

[flagged]

Comment by lloyd-christmas 4 days ago

I thought the same thing when I started using locals, but the reality is that - for a given context depth - the token generation speed doesn't change whether it's 128 or 8000, it just lengthens the benchmark run time.

Comment by ig0r0 5 days ago

I wrote a similar post some time ago just used ollama and opencode https://blog.kulman.sk/running-local-llm-coding-server/

Comment by krzyk 4 days ago

Ollama is not a good choice - https://sleepingrobots.com/dreams/stop-using-ollama/

As for oprncode, doesn't the system prompt eat too much of the context? Local models are really constraint in regards contex, and opencode AFAIR uses a 10k of it or some thing close.

Comment by naikrovek 2 days ago

Ollama seems fine to me, technically. it works. why wouldn't someone use it? Because someone else doesn't like it? if it cost money, I'd pay more attention to the people behind it, but I don't, so I don't.

I wish people would stop wasting their outrage budget on things like this and pay more attention to politics.

Comment by krzyk 2 days ago

Focusing outrage on politics is pointless.

Technical people are rather good at learning new things, and ollama situation is a good learning experience.

llama.cpp gets you more tokens/s even if you ignore ollama team bad behavior.

Comment by naikrovek 8 hours ago

Political outrage is only pointless (in the cases when it is actually pointless), because not enough people get outraged.

Politicians count on your apathy so they can get away with their horseshit. You paying attention is kryptonite to crooked politicians.

Comment by ingvay7 4 days ago

[dead]

Comment by amrtn 3 days ago

Did you have any issue with tool calling inside opencode? I tried the same approach, but my models don't see any tool.

Comment by ig0r0 3 days ago

No issues with OpenCode and the Qwen models. Some issues with Pi because it uses different tool calling format, but I solved that with an extension.

Comment by sleepybrett 5 days ago

actually useful and the ollama gui could probably even simplify this more.

Comment by takethebus 5 days ago

this is the way, given anyone could swap for oh my pi / pi / etc

Comment by mark_l_watson 4 days ago

yes, whether for home experiments or at work, it is good practice (good hygiene) to be able to swap out both agentic harnesses and models. It is important to have a good strategy for exporting skills, etc.

Comment by ig0r0 3 days ago

yeah, I am using Pi right now, I switched from OpenCode

Comment by carter2099 2 days ago

I'm considering this right now. Is it very difficult to adapt to the new philosophy? Are you running mostly interactive or programmatic?

Comment by ig0r0 2 days ago

What do you mean by new philosophy? I use Pi the same way I used gpt or sonnet, just for simpler tasks.

Comment by carter2099 1 day ago

Pi is an agent, gpt and sonnet are LLMs. Opencode is an agent, and has a drastically different philosophy than Pi.

FWIW, I took the dive on Pi today and I’m really happy with my decision so far

Comment by ig0r0 13 hours ago

Yeah, I meant using Pi with the local qwen model vs when I was using claude with sonnet or codex with gpt

Comment by c-hendricks 5 days ago

Not sure you really need huggingface-cli to download anything if you're just using llama.cpp. You can pass `-hf ...` and it will download the models for you. Set `LLAMA_CACHE` to change where the downloads go:

  LLAMA_CACHE="models" ./llama-server \
    -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
    ...

Comment by dofm 5 days ago

Yes.

-hfd for the draft model.

Comment by c-hendricks 5 days ago

Nice, was wondering if there was a flag for the draft as well.

Not knocking huggingface-cli, just find it's much easier for people to try out this stuff when they can just

  mise use --global github:ggml-org/llama.cpp
  LLAMA_CACHE="models" llama-server \
    -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
    --host 0.0.0.0 \
    --port 11434 \
    ...

Comment by dofm 4 days ago

  —no-mmproj

is also pretty useful if you're doing this just to try agentic coding and you're not processing images/voice. Stops it downloading the multimodal projector.

Comment by jumploops 4 days ago

I've been quite impressed with DeepSeek v4 Flash running via antirez's ds4[0].

It feels like a GPT-4 class model in terms of "stored knowledge" but is better at long-horizon tool calling than any of the GPT-4 class models.

Running on a 128GB MBP M4 Max, I'm getting ~24 t/s on generation and ~200 t/s on prefill. I was expecting it to feel slow, and it certainly does when e.g. generating code, but it's surprisingly useful as a "machine orchestrator" for simple tasks.

For non-agentic usecases, it's a decent enough model to converse with, and has the benefit of being entirely self-contained/private.

[0]https://github.com/antirez/ds4

Comment by vladgur 5 days ago

I have used omlx.ai with great success to both download multiple mlx models (including gemma and qwen) suited for my hardware AND to be able to automagically launch both open-source and close-source (claude code, codex) harnesses using these models. All from a web or desktop UI

You would not need to follow a blog post with omlx IMHO

Comment by Dotnaught 5 days ago

In case anyone is looking for a sandbox to go with oMLX and Pi: https://github.com/Dotnaught/pi-sandbox

Comment by dofm 5 days ago

This is useful. I'm still tinkering with Multipass VMs because I need the whole VM environment anyway and I'm on Sequoia. But I'd be interested if you did anything like that with Apple's container CLI instead; sooner or later I will have to upgrade to Tahoe because I want to play with the container CLI (and apfel).

Comment by zmmmmm 4 days ago

it looks handy but ...

    sbx policy set-default open

just so the single pi sandbox can talk to localhost? ... this gives me some grave doubts about the rest of it being set up well.

Comment by dofm 5 days ago

FWIW I have not, on a 64GB M1 Max, seen any advantage from oMLX specifically or MLX generally over GGUF with llama.cpp.

The Gemma 4 MLX builds I have found so far have been slower at the same quantisation and much slower with MTP.

The built-in web UI for llama.cpp is really quite good once you have chosen your model. Otherwise I quite like LM Studio for tinkering.

One thing I would say is that both Gemma-4 and Qwen 3.6 simply do not need a large chunk of the typical opencode system prompt. Better off without it.

Comment by fouc 4 days ago

what? you're saying both MLX and MTP have been slower for your mac?

Comment by amboo7 4 days ago

I also have an M1 Max 64GB: Qwen 3.6 benefits from MTP (after rounds of parameter optimization). MLX was unstable (haven't tried it recently), faster at TG but slower at PP, so inconclusive.

Comment by dofm 4 days ago

Yeah. I have not really tinkered much with parameter optimisation for the 35B model with MTP. Would be interested to see what you've found.

I'm using the GGUF too; it appears slightly faster in llama.cpp now than current LM Studio but it's not clear to me if that is down to LM Studio having a little more code overhead, older llama.cpp under the hood, or just parameter differences.

Comment by amboo7 4 days ago

[dead]

Comment by dofm 4 days ago

MLX in the forms I have tried (LM Studio, oMLX) with the models I have tried (Gemma and Qwen) have both been apparently slower, yes.

I have not done in-depth, really controlled testing and there is much about performance tuning I don't understand, but it's fairly clear to me that on an M1 Max, MLX does not have the massive advantage it may have on other machines or other models.

It is wholly possible that MLX is _much_ better on the M3 and up, because the neural engine is that much better.

Frankly I think llama.cpp may simply have caught up quite a lot.

MTP is the same issue. There is always a chance that adding a separate MTP draft model has more compute overhead than it brings in terms of speedup, and since I am using an older machine and the MoE models, I am not actually in a zone where MTP can actually add much. What happens is that there's an enormous advantage in speed handling while the prompt and the early reasoning and it then tails off dramatically to be worse, on average, than non MTP.

(Qwen 3.5 35B shows, possibly, a small advantage if its internal MTP is enabled. But it is small — 10% maybe.)

For the 26B Gemma 4, MLX and MTP combined were noticeably slower than the GGUF is with llama.cpp.

If it were a newer machine with a larger, dense model, I'd definitely expect to see an advantage from MTP, and it is possible that there are some parameters I can tweak (duplicate token penalty, temperature, shared cache stuff) that give MTP more of an edge (keep its successful prediction rate higher).

Either way, it feels like the smallish gain I will see on this particular bit of kit might not be worth the long, long journey down that rabbit hole right now.

Comment by fridder 5 days ago

It truly is the SOTA for local inference on mac. Even when there are regressions the dev(s) are insanely responsive. It is the most impressive opensource project I've seen in a awhile

Comment by benbojangles 5 days ago

Omlx needs to incorporate macos native shortcuts use - macos can almost instantly extract text from pdfs and a bunch of other things using it's ane neural engine keeping unified ram for llm use. The two together would be awesome

Comment by jmkni 5 days ago

FYI you can open Claude code in the terminal, point it at this article and just tell it to "do it", if you're feeling extra lazy

Comment by echelon 5 days ago

This is the way.

I'm not Googling much of anything anymore. 9/10 times the information is awful, it's hard to parse out of whatever other spam it's surrounded by. Meanwhile, Claude will just do the thing one-shot or with a tiny bit of refinement.

The gateway to knowledge and getting stuff done is the LLM.

Google Search is a dinosaur.

It feels like we're living a century into the future. Not even smartphones were this cool.

Comment by kingofthehill98 5 days ago

Yeah, if the future is "Claude, think for me" I'm happy to stay at the good old present.

Comment by echelon 5 days ago

https://en.wikipedia.org/wiki/Is_Google_Making_Us_Stupid%3F

https://newsletter.pessimistsarchive.org/p/when-educators-mo...

New decade, same old argument.

It's not

> "Claude, think for me"

It's

> "Claude, be my subordinate and get this done for me"

Instead of complaining on the sidelines, I'm getting a shit ton of work done.

Comment by ultrarunner 5 days ago

For what it's worth, even this reply reads like LLM output. It's not "quote describing the scenario", it's "some other linked-in-coded plot twist". If you're the average of the people you spend the most time around, and you spend the most time around a chatbot, do you start to absorb its speech patterns and logic structures?

Yeah, good ol' present for me too then, thanks.

Comment by wwweston 4 days ago

As one famous agent said: “I say your civilization because as soon as we started thinking for you it really became our civilization which is of course what this is all about.”

An argument can be as old as the search engine and hold real value. There are ways in which unreflective search engine use has misled and mistrained people.

There’s always been argument to be had about how we manage and offload attention, what we gain and what we lose when resistance is reduced. It’s part of reflection that’s been necessary in order to make progress solid ground, and is more necessary with non-deterministic tech.

The phrase “Tactical tornados” may be older than web search and describes people who also got a lot done.

Models can be incredibly helpful boosters and situationally effective subordinates… and also patchy as a real engineering IC or org.

Comment by this_user 5 days ago

> Instead of complaining on the sidelines, I'm getting a shit ton of work done.

Nah, you are just producing a bunch of slop and hope that nobody notices.

Comment by 4 days ago

Comment by sdevonoes 5 days ago

> I'm getting a shit ton of work done.

It’s weird when people are proud of doing ton of work. Im the opposite, Im proud that Im doing minimal stuff without llms.

Comment by dominotw 4 days ago

> I'm getting a shit ton of work done.

maybe you stopped thinking too much that you dont regonize that you are just producing slop that no one cares about.

AI is now getting humans to produce slop

Comment by coldtea 4 days ago

The argument was correct then (Google/social did make us more stupid) and correct now regarding AI. So not sure why pointing out it was said before is relevant. Except as an example of its prescience.

>"Claude, be my subordinate and get this done for me"

Since "this" is thinking, then the two formulations are equivalent.

>Instead of complaining on the sidelines, I'm getting a shit ton of work done.

Until you no longer have a job and are drowned in slop.

Comment by 5 days ago

Comment by tobyhinloopen 5 days ago

Claude “respond in a friendly way that I agree with this comment”

Comment by iammrpayments 4 days ago

There’s no way this is not a paid comment, I see stuff like this everywhere in HN nowadays

Comment by coldtea 4 days ago

>It feels like we're living a century into the future.

The WALL-E chair-people future.

Comment by dofm 5 days ago

Useful stuff in here that I wish I'd seen a few days ago :-)

I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.

Fiddling about with local models has done so much for my conceptual understanding of what is going on.

FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.

Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.

Comment by mft_ 5 days ago

I found a marginal downside to Qwen3.6-35B-A3B-MTP vs. the non-MTP equivalent on an M1 Max. I’ll maybe experiment with settings further though.

Comment by freehorse 5 days ago

And the upsides of using draft models for MOE models with so low number of active parameters (as here or as in the article) are quite low, compared to dense models where you can get enormous speedups. I would prefer running the dense 27b models with speculative decoding instead.

Comment by dofm 5 days ago

That is what I have learned, yes. Not tested the dense Qwen yet. IIRC the 31B Gemma was slow enough that I doubt MTP will help me much.

Comment by smcleod 4 days ago

Use the 27b, it's better in every way once you add MTP (which speeds up dense models but often doesn't add any performance to MoE models like the 35b-a3b). I get around 100TK/s on my 2x 3090 machine and 85 on my M5 Max.

Comment by mft_ 4 days ago

Thanks, I'll give it a go.

(I generally find standard 27B too slow to enjoy using, whereas 35B-A3B is pretty snappy.)

Comment by dofm 5 days ago

Yeah. I think it might speed up time to first token but I am not sure how much that matters.

I do enjoy their different personalities when they are tackling "explain this" type puzzles, though.

Gemma writes so well — like a concise code blogger. It makes you understand that the thing we hate about AI slop writing is specifically the cheesy, marketingese sycophantic ChatGPT tone. It's a choice to sound that way.

Qwen writes more tersely by default, like much english language documentation in Chinese open source projects. A couple of lines, code example, fact, code example, line of blurb.

I use this prompt every now and then with a new model. It's obviously a classic SQL puzzle but I've asked new web developers this in the past (prompted by discovering that a client's subcontractor didn't understand it and was therefore unable to migrate some code from relying on dodgy pre-MySQL 5.x behaviours)

—

  I have a MySQL 5 table like this: [id, label, category, score].   It contains a list of items in different categories (text names like cat1, cat2, cat3) with a numerical score. Is there a way I can write a SQL query to find the item in each category that has the highest score, without using a subquery? No two entries in any category share a score.

—

I enjoy seeing what it deduces from the subtext.

Without "thinking" mode on, they always initially fail and you need to prompt them to find the answer. With thinking mode, they both produce really nice explanations.

For me, as an old freelancer who is pretty cynical about vibe coding or "agentic engineering", what I really want is an AI tool that can help me start to solve problems and help me find the right terminology or generate some boilerplate I can tinker with. Both of these models do fine at the kind of "starter" writing that I want when I am trying to untangle an idea.

Comment by mark_l_watson 4 days ago

when I started using QAT recently, I stopped trying to improve my configuration after that. I will try tuning my local environment again in a few months, but with QAT things are good enough for now.

Comment by ljosifov 4 days ago

For high Ram (unified), and relatively middling to lowish Tflops and bandwidth GB/s, usually MoEs are most hopeful. The current top-1 in the (iq, tok/s, @ context depth) ranks for me (M2 Max, 96gb) is DeepSeek-V4-Flash REAP25 <65gb gguf + ds4-server + pi agent. Not better than cloud API ofc, but useful enough to endure if I need to. E.g on a non-Internet 4h flight the battery (local llm draws 60w) held long enough. REAP supporting ds4 branch here

https://github.com/ljubomirj/ds4/tree/reap-compact-support

DS4F dropping to unusable <10 tok/s only at 784K context (!!) makes a big difference.

Comment by reddit_clone 5 days ago

>64 GB

Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.

My past attempts (with Ollama and various LLMs) were too slow to use.

Comment by hkchad 5 days ago

I have a M5 MAX with 128, local models are toys compared to hosted ones. I've spent a lot of time and money trying to make it work even 1/2 as well.

Comment by dofm 4 days ago

It all depends on what you want to do, I guess.

If you're seeking the kind of hands-off claude experience, obviously not. They are slow.

If you want to learn how these things work, train them locally, tinker, play with the code, grasp the fundamentals, or just out of sheer bloody-mindedness and principle refuse to tether the functioning of your application to a cloud API...

Comment by leemoore 4 days ago

I have the same processor and ram. The dense 30b ish Gemma/Qwen really don't break 10 TPS with or without MTP. MOE's in this range feel more usable if they are smart enough for your work. Probably would still use hosted versions of these over local unless. MOE's feel somewhere between sonnet 3.5 and 3.7 to me. Dense feels between sonnet 3.7 and 4 in basic coding or local agentic capabilities (not close to those in chat or world knowledge)

Comment by jillesvangurp 4 days ago

From an economical point of view, there's almost no point to using these locally running models. The only things they are good for would be dirt cheap using the smaller/older models via some API as well. Recovering the investment for the hundreds/thousands you spend extra on hardware easily funds a lot of that. Unless you are using this stuff at scale, it's probably not going to be worth it.

I've dabbled with Qwen 3.x and Gemma 4 models a bit. They are alright but not that impressive. And my mac gets super hot if I use them for extended periods of time. It's just not very nice to use locally.

Comment by 4 days ago

Comment by iluvcommunism 5 days ago

[dead]

Comment by dofm 5 days ago

Some of these models will be a bit of a squeeze at Q4_0 I suspect; almost certainly they will be using CPU. Probably the 31B Gemma will be too much. Maybe not the Gemma-4 26B QAT.

But if you just want to play around rather than code, you really might find the Gemma 4 12B model worth mucking about with just so you've gone through the steps. Especially if you want to muck about with image analysis or audio transcription.

If you're writing PHP I think you could even find it good enough. I've been modestly surprised. You can do that basic fiddling with the Edge AI Gallery app, which can enable thinking and has a customisable system prompt and some agent support.

You could also try the 14B Deepseek R1.

Honestly even if it is not good enough, if you are anything like me, I think you'll find that going through this process is really quite educational — it has made a lot of things more concrete for me in a way that I have found reassuring and valuable.

Comment by contingencies 5 days ago

M4 24GB here. You'll be fine, if you're anything like me minor latency is acceptable to obtain (a) privacy (b) reliability (c) CI/CD/guardrails (d) network independence (e) future-proofing vs. AIaaS. https://omlx.ai/ gives you intelligent local hardware based model download recommendations. That said it probably depends heavily on your workload, process and polish expectations. See also https://news.ycombinator.com/item?id=48089091

Comment by spike021 4 days ago

what are you using on yours? I've got a M4 Pro 24GB also. tried the open source gpt one. it's alright but I found it can get stuck at times. maybe just my config in LM Studio.

Comment by contingencies 4 days ago

pi + Qwen3-4B-Instruct-2507 / Qwen3.6-35B-A3B-4bit / Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-4.5bit-msq depending how seat-of-pants I want to fly on memory.

Comment by codazoda 5 days ago

I'm running an M3 on an Air with just 16GB. I can still get useful results without an internet connection in "chat mode". It's a different experience than using Claude, for sure, but it's workable. I typically use the Qwen variants these days.

Comment by mark_l_watson 4 days ago

This might be useful when ‘coding in chat mode’: I have a few scripts that I run in a project directory that takes a prompt from me, and creates a single long one-shot prompt that I can paste into a chat window and ask that any generating code is inside markdown code blocks for easier copy/pasting. Also, pardon the plug, but you can read my new tiny book free online that documents my experiences using agentic coding on my 16G Mac and my 32G Mac: https://leanpub.com/read/local-coding-agents

Comment by codazoda 4 days ago

Looks cool, I’ll checkout the book. Your download links (PDF and EPUB) are down for me.

> NoSuchKeyThe specified key does not exist…

Comment by c0rruptbytes 4 days ago

i’m running m4 pro 48gb right now

omlx + gemma 12b 6 bit + pi

it’s feasible for sure

MoEs for speed (qwen 35b, cohere 30b, gemma 26b)

Dense for more methodical work (qwen 27b [reigning champ], gemma 31b, gemma 12b)

MoE i recommend 5bit+

Dense i think 4 bit is okay

Play with your context size, you don’t really need that much, have lazy loading for tools and mcps

my pi extensions for anyone looking for a skinny quick setup, i have use `--no-skills` right now too:

    "npm:pi-codex-goal",
    "npm:pi-simplify",
    "npm:pi-mcp-adapter",
    "git:github.com/elpapi42/pi-minimal-subagent",
    "npm:@wierdbytes/pi-statusline",
    "npm:@aliou/pi-guardrails",
    "npm:pi-lens",
    "npm:@juicesharp/rpiv-todo",
    "npm:pi-hashline-readmap",
    "npm:@mrclrchtr/supi-review",
    "npm:pi-cmux",
    "npm:@mrclrchtr/supi-context",
    "npm:pi-tool-search"

think of local models as "zero sugar" models and that's where we're at right now. I think it's crazy how good these models are compared to last year's frontier models

Comment by krzyk 4 days ago

People are using 3090 (24GB) to run models, and it is the most cost effective way to run the. Yes, it is 2x faster, but memory wise you surely can spend 24gb on llm.

Also there are smaller, still usefull models that can run on 8GB or less.

Comment by jmkni 4 days ago

I've an M1 Pro with 32GB ram and it's running pretty well

Comment by LoganDark 5 days ago

I poured a couple days into custom Burn inference for Qwen3-Coder-Next only to find it doesn't come with a speculative decoder, so on my M4 Max I can't push it much further than 120t/s. That's still kinda slow, though still faster than llama.cpp's 70.9t/s and MLX's 80.6t/s with the same model. Claude Fable 5 is recommending I use the Qwen3 MTP -- I worry that will compromise the quality somewhat, but might give it a try to see if I can get more usable speeds.

Comment by hanifbbz 5 days ago

Here's a visual post for using LM Studio and VS Code (and Pi): https://blog.alexewerlof.com/p/local-llms-for-agentic-coding

One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).

Comment by mark_l_watson 4 days ago

Nice writeup, thanks.

I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.

Comment by bluerooibos 4 days ago

I cannot wait until a time in the future when we have local models that are Opus 4.6+ level, and capable of running on inexpensive hardware like a 16Gb Mac. Hopefully that's only a few years away.

Comment by d4rkp4ttern 4 days ago

It’s relatively simple to use llama.cpp/server to spin up a local LLM to work with Claude Code or Codex-CLI. The required llama server settings are often scattered all over so I maintain a set of instructions here for several popular open LLMs:

https://pchalasani.github.io/claude-code-tools/integrations/...

Comment by ricardobeat 4 days ago

Do you use that as a daily driver? Claude Code' prompt is huge and causes you to spend a long, long time on prompt processing for local models, then running out of context shortly after.

Comment by d4rkp4ttern 4 days ago

Yes CC prompt can be ~30K tokens. I definitely do not use this as a daily driver. I did use it a few times for sensitive document work with Qwen3.6 MOE.

Comment by alexwwang 4 days ago

I wonder if these local model could really solve problems especially for users that aren’t experts on a given coding language. I am not sure that, more than inline auto completion and unit implementation, are these model capable of designing and composing tech specs that really work.

Comment by hmontazeri 4 days ago

I use LM Studio with the local server it ships and connect it to opencode. Takes 2 min to setup

Comment by godfathermway 4 days ago

[dead]

Comment by reenorap 5 days ago

My biggest pet peeve with all these articles on local AI is the only thing they talk about is tokens per second. No one mentions the quality of the answers. No one. I don't mind waiting a little longer if the quality is better. Quickly serving me slop doesn't make it more useful. Are people really only looking at tokens per second?

Comment by frollogaston 4 days ago

The model already has its own quality benchmarks elsewhere. The article is just about running the model on X hardware, so the remaining question is then how fast it is. Or does the output quality somehow depend on the hardware too?

Comment by ozim 4 days ago

Local model as such will give you "autocomplete on steroids" but it is not going to run away and implement cross project feature like frontier model in let's say Cursor.

So there is no value in testing quality of answers, but there is value in testing token speed.

You just have to have correct expectations.

Comment by krzyk 4 days ago

Is autocomplete using LLMs really useful? Even with frontier models I found it to be about 50% right, I turned it of and prefer to use IntelliJ built-in, it is way more reliable.

For me local models is all about quality, and how to achieve that - e.g. by providing guardrails that test the job done.

Comment by jmkni 4 days ago

The quality is obviously much worse, but still useful as a reference if you generally know what you are doing

It solve the "I'm coding on the plane and need to look up this thing I've forgotten" problem, for me at least

Comment by akman 4 days ago

That's fair. There are even many dimensions to define 'quality' which include use case (coding? writing? multimedia?) and prompt. I suppose if you ask testers to provide benchmarks with their analysis, that might hamper their desire to share.

Comment by namnnumbr 5 days ago

oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.

Comment by w10-1 5 days ago

Agreed (not sure what you mean by UI-based hosting).

oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.

Comment by amboo7 4 days ago

Whay about of the tons of caches that just pile up until you notice that you must delete them manually?

Comment by anigbrowl 4 days ago

This video is realtime. And shows the agent responding at a perfectly usable speed.

Alas, this video appears not have been linked to the text that describes it. Perhaps I should ask an AI to generate an artistic rendering of the author's description.

Comment by freerunnering 4 days ago

The video is stuck in an `<img>` tag so you need to wait for it to load. On a slow connection it might just not show for a while. Though the video is only 1MB so should load in if you wait.

Comment by anigbrowl 4 days ago

I have >400MiB/s on this machine and had already spent several minutes reading through the explanation/instructions before scrolling back to the top; it just never loads for me. I had to manually open the link in another tab, for whatever reason.

Comment by smetannik 4 days ago

I wonder why something like LM Studio didn't work for the author?

Comment by b3ing 4 days ago

That’s what I was wondering, lm studio and draw things are easy to use apps that handle much of the cruft for you

Comment by freerunnering 4 days ago

I do a lot of fine tuning and development with small models themselves (not just using an LLM over a HTTP API). So downloading the models directly and running them from the CLI was natural for me, so that's what I reached for when I wanted to play around with this.

Comment by 4 days ago

Comment by cdolan 5 days ago

Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this

Comment by 5 days ago

Comment by dewey 5 days ago

That's the direct link: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...

Comment by c-hendricks 5 days ago

Note this is cut to just before the model responds, so not a great way for people to judge the real-time feel of this.

Comment by freerunnering 4 days ago

The full video is on Twitter: https://x.com/Freerunnering/status/2065275403548168398

Plus a followup one where you see me type the question in and press enter (though that video is with Qwen 3.6, not Gemma 4) https://x.com/Freerunnering/status/2065354101878055038

Comment by bicepjai 4 days ago

I assumed lmstudio is the obvious choice after ollama. Is there a reason lmstudio is not used widely ?

Comment by dofm 4 days ago

LM Studio is fine. Gorgeous actually. I've found it really helpful for understanding parameters, settings, general figuring out.

But there is an incentive not to use it if you want to write an article that uses only open-source tools, because it isn't.

Comment by krzyk 4 days ago

Why would anyone use Ollama at all (aside from obvious reasons one can look up online) - llama.cpp used directly, without this wrapper is faster.

Basically one has two real choices for local LLMs: llama.cpp (if single user) or vLLM (if multi-user/enterprise).

Comment by stingraycharles 4 days ago

Yeah I’ve also been using it on macOS, my experience is that it works better with the metal API and has better performance.

Comment by 5 days ago

Comment by attogram 5 days ago

8b max on a std 16gb macbook. Anything more and your mac is toast

Comment by benbojangles 5 days ago

70b on my M1 max 64gb

Comment by Obscurity4340 4 days ago

How much did tha thing cost?

Comment by rectang 5 days ago

Does anybody run a local agent on a Mac using an outboard GPU?

Comment by benbojangles 5 days ago

I run a second Mac for local llm use and access it remotely using ssh from the first mac

Comment by metadaemon 5 days ago

Has anyone compared a setup like this to just using LM Studio?

Comment by CharlesW 5 days ago

Yes, I can confirm that LM Studio works great for this.

Comment by everlier 4 days ago

You can also install Harbor and then it's:

harbor up omlx opencode

Comment by yesitcan 4 days ago

Why not just get Claude to set it up?

Comment by k2enemy 4 days ago

Grammar note:

When used as a verb, it should be "set up," and when used as a noun, "setup."

Other examples (verb, noun):

back up, backup

shut down, shutdown

break down, breakdown

warm up, warmup

Comment by sleepybrett 5 days ago

or you can just load up ollama, have it load a local model and point claude or opencode at it...

is this article old? It's not. I'm not sure why he went through all the bother of llama.cpp

Comment by malkosta 5 days ago

That was exactly my same question. Then I finished reading the post. The reason is pretty clear, and written in the post: it is faster than ollama+mlx.

Comment by sleepybrett 5 days ago

how much faster?

Comment by freerunnering 4 days ago

I was benchmarking different models, different engines, and different draft models, I posted a video on twitter, and people started asking about the setup in the final screen recording. So the blog post isn't so much "how a beginner should setup something" it's "here's the setup I posted in the video".

Original video: https://x.com/Freerunnering/status/2065275403548168398

And in the blog post there is a table showing the different speeds I got from different engines.

Slowest combo was 38.1 tk/s, and the fastest was 72.2 tk/s. All from "the same" model.

Comment by malkosta 1 day ago

All I remember is that it's pretty clear written in the post...

Comment by krzyk 4 days ago

ollama is a wrapper on top of llama.cpp, and it makes llama.cpp slower, why use it?