Show HN: OpenGraviton – Run 500B+ parameter models on a consumer Mac Mini
Posted by fatihturker 3 days ago
Hi HN,
I built OpenGraviton, an open-source AI inference engine designed to push the limits of running extremely large models on consumer hardware.
The system combines several techniques to drastically reduce memory and compute requirements:
• 1.58-bit ternary quantization ({-1, 0, +1}) for ~10x compression • dynamic sparsity with Top-K pruning and MoE routing • mmap-based layer streaming to load weights directly from NVMe SSDs • speculative decoding to improve generation throughput
These allow models far larger than system RAM to run locally.
In early benchmarks, OpenGraviton reduced TinyLlama-1.1B from ~2.05GB (FP16) to ~0.24GB using ternary quantization. Synthetic stress tests at the 140B scale show that models which would normally require ~280GB FP16 can fit within ~35GB when packed with the ternary format.
The project is optimized for Apple Silicon and currently uses custom Metal + C++ tensor unpacking.
Benchmarks, architecture, and details: https://opengraviton.github.io
GitHub: https://github.com/opengraviton
Comments
Comment by fatihturker 2 days ago
I'm currently working on further speed improvements — it's already around 8× faster in some cases, but there’s still potential for more optimization.
Since this is an open-source project, community support is very important. I believe AI shouldn’t be controlled or driven by only a few companies, so contributions, feedback, and ideas are always very welcome. Feel free to open an issue or PR if you'd like to help.
Comment by LukeB42 2 days ago
Maybe the author could get a large param model to help him get this done though.
Comment by fatihturker 2 days ago
Comment by fatihturker 3 days ago
The architecture page explains how ternary quantization, dynamic sparsity, and mmap layer streaming work together to push models far beyond normal RAM limits.
Happy to answer questions about the implementation or benchmarks.
Comment by MrLey 3 days ago