I eliminated matrix multiplication from transformers using 1965 Soviet research

Honest answer: I tested it on GPT-2 (124M) and the results are mixed. The mathematical claims hold up. I ran 58 tests covering ternary matmul correctness, memory compression, and numerical stability. The 16x compression works, the zero-multiplication property is verified, and the epistemic layer correctly abstains on high-entropy distributions. What does not work is post-training quantization. When I quantized GPT-2's weights to ternary and ran generation, the output was garbage. This is expected because the model was never trained with ternary constraints. BitNet gets coherent output because they train from scratch with ternary baked in. I did not do that. The actual novelty here is not the quantization itself but the epistemic output layer that treats the ternary zero as "I do not know" rather than just sparsity. My tests show it correctly abstains on future predictions and impossible knowledge while answering factual queries confidently. But I should be clear that these tests use designed distributions, not outputs from a trained model. I do not have the compute to train a ternary model from scratch, so coherent generation remains theoretical. The code is at github.com/Zaneham/Ternary_inference if you want to poke at it. Happy to be proven wrong on any of this. tl:dr yes it works but current models aren't made for it. The most interesting thing is the llm can say when it doesn't know.

I eliminated matrix multiplication from transformers using 1965 Soviet research

Comments