Show HN: Run 500B+ Parameter LLMs Locally on a Mac Mini
Posted by fatihturker 1 day ago
Hi HN, I built OpenGraviton, an open-source AI inference engine that pushes the limits of running extremely large LLMs on consumer hardware. By combining 1.58-bit ternary quantization, dynamic sparsity with Top-K pruning and MoE routing, and mmap-based layer streaming, OpenGraviton can run models far larger than your system RAM—even on a Mac Mini. Early benchmarks: TinyLlama-1.1B drops from ~2GB (FP16) to ~0.24GB with ternary quantization. At 140B scale, models that normally require ~280GB fit within ~35GB packed. Optimized for Apple Silicon with Metal + C++ tensor unpacking, plus speculative decoding for faster generation. Check benchmarks, architecture, and details here: https://opengraviton.github.io GitHub: https://github.com/opengraviton This project isn’t just about squeezing massive models onto tiny hardware—it’s about democratizing access to giant LLMs without cloud costs. Feedback, forks, and ideas are very welcome!
Comments
Comment by zhangchen 5 hours ago
Comment by bbtc3453 4 hours ago
Comment by swq115 15 hours ago
For context, I run a Mac Mini M4 as a homelab server and the memory pressure from even 7B models is noticeable. Curious how this handles sustained inference without thermal throttling.
Comment by ryanholtdev 1 day ago
Comment by anentropic 23 hours ago
https://github.com/opengraviton/graviton?tab=readme-ov-file#...
the benchmarks don't show any results for using these larger-than-memory models, only the size difference
it all smells quite sloppy
Comment by hu3 19 hours ago
~19 tok/s for Apple M1 Max (64 GB) with TinyLlama-1.1B-Chat-v1.0
Comment by pcf 18 hours ago
I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.
Comment by deflator 20 hours ago