Show HN: Run 500B+ Parameter LLMs Locally on a Mac Mini

Posted by fatihturker 1 day ago

Hi HN, I built OpenGraviton, an open-source AI inference engine that pushes the limits of running extremely large LLMs on consumer hardware. By combining 1.58-bit ternary quantization, dynamic sparsity with Top-K pruning and MoE routing, and mmap-based layer streaming, OpenGraviton can run models far larger than your system RAM—even on a Mac Mini. Early benchmarks: TinyLlama-1.1B drops from ~2GB (FP16) to ~0.24GB with ternary quantization. At 140B scale, models that normally require ~280GB fit within ~35GB packed. Optimized for Apple Silicon with Metal + C++ tensor unpacking, plus speculative decoding for faster generation. Check benchmarks, architecture, and details here: https://opengraviton.github.io GitHub: https://github.com/opengraviton This project isn’t just about squeezing massive models onto tiny hardware—it’s about democratizing access to giant LLMs without cloud costs. Feedback, forks, and ideas are very welcome!

Comments

Comment by zhangchen 5 hours ago

The mmap layer streaming approach is smart for working around memory limits. In practice though, 1.58-bit ternary quantization tends to degrade quality noticeably on reasoning-heavy tasks compared to 4-bit — curious if you've measured perplexity deltas at the 140B scale.

Comment by bbtc3453 4 hours ago

This is impressive. I've been experimenting with Gemini API for a side project and the latency difference between local and cloud inference is something I keep thinking about. How does memory usage scale with the 500B models?

Comment by swq115 15 hours ago

Interesting approach. The mmap streaming idea is clever, but I'd love to see real-world benchmarks beyond TinyLlama — especially for the 140B claim. Running that on a Mac Mini with 16GB would be the real proof point.

For context, I run a Mac Mini M4 as a homelab server and the memory pressure from even 7B models is noticeable. Curious how this handles sustained inference without thermal throttling.

Comment by ryanholtdev 1 day ago

Running a Mac Mini M4 as a home server for a bunch of automation scripts right now. The mmap-based layer streaming is the part I'm most curious about -- how does latency look when you're streaming layers from disk mid-inference? I'd expect throughput to degrade sharply once you exceed unified memory, but maybe the Top-K sparsity masks enough of the weight accesses that it's not as bad as sequential streaming would be. What's the actual tokens/sec at 140B scale on the base Mac Mini config?

Comment by anentropic 23 hours ago

Yeah...

https://github.com/opengraviton/graviton?tab=readme-ov-file#...

the benchmarks don't show any results for using these larger-than-memory models, only the size difference

it all smells quite sloppy

Comment by hu3 19 hours ago

What could find in the readme shows:

~19 tok/s for Apple M1 Max (64 GB) with TinyLlama-1.1B-Chat-v1.0

Comment by pcf 18 hours ago

Hi @fatihturker – exciting project if it works!

I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.

Comment by deflator 20 hours ago

Fascinating. I don't understand the technical terms, but running a big coding agent locally is a dream of mine, so I thank you for your efforts!