Show HN: OpenGraviton – Run 500B+ parameter models on a consumer Mac Mini

Posted by fatihturker 3 days ago

Counter13Comment5OpenOriginal

Hi HN,

I built OpenGraviton, an open-source AI inference engine designed to push the limits of running extremely large models on consumer hardware.

The system combines several techniques to drastically reduce memory and compute requirements:

• 1.58-bit ternary quantization ({-1, 0, +1}) for ~10x compression • dynamic sparsity with Top-K pruning and MoE routing • mmap-based layer streaming to load weights directly from NVMe SSDs • speculative decoding to improve generation throughput

These allow models far larger than system RAM to run locally.

In early benchmarks, OpenGraviton reduced TinyLlama-1.1B from ~2.05GB (FP16) to ~0.24GB using ternary quantization. Synthetic stress tests at the 140B scale show that models which would normally require ~280GB FP16 can fit within ~35GB when packed with the ternary format.

The project is optimized for Apple Silicon and currently uses custom Metal + C++ tensor unpacking.

Benchmarks, architecture, and details: https://opengraviton.github.io

GitHub: https://github.com/opengraviton

Comments

Comment by fatihturker 2 days ago

Author here.

I'm currently working on further speed improvements — it's already around 8× faster in some cases, but there’s still potential for more optimization.

Since this is an open-source project, community support is very important. I believe AI shouldn’t be controlled or driven by only a few companies, so contributions, feedback, and ideas are always very welcome. Feel free to open an issue or PR if you'd like to help.

Comment by LukeB42 2 days ago

Had to fix hardware detection myself only to find engine.generate()'s not implemented and yields "".

Maybe the author could get a large param model to help him get this done though.

Comment by fatihturker 2 days ago

Happy to help if needed. The project is already tested and benchmarked with several models and everything is working as expected. If you run into any specific issues, feel free to open an issue or PR.

Comment by fatihturker 3 days ago

Author here.

The architecture page explains how ternary quantization, dynamic sparsity, and mmap layer streaming work together to push models far beyond normal RAM limits.

Happy to answer questions about the implementation or benchmarks.

Comment by MrLey 3 days ago

This is cool project