GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz
Posted by laxmena 19 hours ago
Comments
Comment by cadamsdotcom 18 hours ago
Which means really impressive when those N’s are small!
I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up.
Likely would need a lot of algorithm parallelism work that’d translate back to CPUs/GPUs.
Comment by T-A 17 hours ago
Comment by naikrovek 10 hours ago
They will never be as fast as an NPU designed to run large models, though. GPUs are extremely general purpose in comparison, and FPGAs are about as general purpose as one can get.
Comment by genxy 18 hours ago
Comment by amelius 19 hours ago
https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-...
TL;DR: The CPU implementation was 71x faster than the FPGA.
Note: model has only 4192 parameters.
Comment by hedgehog 18 hours ago
Now, there are a bunch of interesting things about this project. Seeing the example of a tiny transformer running on FPGA is informative, and that it was apparently a pretty quick project for one person + robot assistance. Probably some transferable lessons for anyone else doing robo-FPGA development.
Comment by cyanydeez 18 hours ago
but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.
Comment by upboundspiral 16 hours ago
You can scroll through r/localllama and find tons of people getting useable speeds out of Qwen 35B.
24 tok / second on an ancient 1080ti
https://old.reddit.com/r/LocalLLaMA/comments/1tcc7h5/24_toks...
100 tok / second on a 4070
https://old.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_tok...