I eliminated matrix multiplication from transformers using 1965 Soviet research
Posted by ZaneHam 7 hours ago
Comments
Comment by ZaneHam 7 hours ago
Author here, I've been collecting historical computing documentation for a few years and found Brusentsov's balanced ternary research from Moscow State University (1958-1965). Applied it to modern transformers.
Some interesting results:
93.8% energy reduction per inference, 16x memory compression (7B model: 28GB → 1.75GB), Zero floating-point multiplication, Runs on CPUs, no GPU required and Architectural epistemic uncertainty (it won't hallucinate what it doesn't know)
Repo: https://github.com/Zaneham/Ternary_inference
Happy to answer questions :-) Happy holidays and merry christmas!
Comment by mika6996 5 hours ago
Did you try this method on any model? What do benchmarks say?
Comment by ZaneHam 5 hours ago
Honest answer: I tested it on GPT-2 (124M) and the results are mixed.
The mathematical claims hold up. I ran 58 tests covering ternary matmul correctness, memory compression, and numerical stability. The 16x compression works, the zero-multiplication property is verified, and the epistemic layer correctly abstains on high-entropy distributions.
What does not work is post-training quantization. When I quantized GPT-2's weights to ternary and ran generation, the output was garbage. This is expected because the model was never trained with ternary constraints. BitNet gets coherent output because they train from scratch with ternary baked in. I did not do that.
The actual novelty here is not the quantization itself but the epistemic output layer that treats the ternary zero as "I do not know" rather than just sparsity. My tests show it correctly abstains on future predictions and impossible knowledge while answering factual queries confidently. But I should be clear that these tests use designed distributions, not outputs from a trained model.
I do not have the compute to train a ternary model from scratch, so coherent generation remains theoretical. The code is at github.com/Zaneham/Ternary_inference if you want to poke at it. Happy to be proven wrong on any of this. tl:dr yes it works but current models aren't made for it. The most interesting thing is the llm can say when it doesn't know.