GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

Posted by laxmena 19 hours ago

Counter39Comment13OpenOriginal

Comments

Comment by cadamsdotcom 18 hours ago

Transformers scale poorly vs. context window size and parameter count.

Which means really impressive when those N’s are small!

I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up.

Likely would need a lot of algorithm parallelism work that’d translate back to CPUs/GPUs.

Comment by naikrovek 10 hours ago

Huge FPGAs don’t really exist, but you can couple many together with high-speed interconnects.

They will never be as fast as an NPU designed to run large models, though. GPUs are extremely general purpose in comparison, and FPGAs are about as general purpose as one can get.

Comment by genxy 18 hours ago

The context window is 16 characters. Talking about tokens per second is meaningless.

Comment by dominotw 18 hours ago

its not meaningless. there could be usecases like spell correction.

Comment by genxy 17 hours ago

It is only interesting as an academic exercise in EDA design. Just like microGPT. For something with an n^2 complexity and advertising perf is clickbait.

Comment by amelius 19 hours ago

See also:

https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-...

TL;DR: The CPU implementation was 71x faster than the FPGA.

Note: model has only 4192 parameters.

Comment by hedgehog 18 hours ago

That post is uninteresting both because they miss the point, and it's not clear a human was even involved to perceive a point to miss. Sure, with an unlimited transistor budget, power budget, and a design clocked at 4GHz fabbed on 5nm one of the best CPU design teams in the world can make a thing that is straight line faster than a one-person project running at 80MHz on a 20 year old 65nm FPGA. Any other answer would be extremely surprising.

Now, there are a bunch of interesting things about this project. Seeing the example of a tiny transformer running on FPGA is informative, and that it was apparently a pretty quick project for one person + robot assistance. Probably some transferable lessons for anyone else doing robo-FPGA development.

https://github.com/fguzman82/gateGPT/tree/main/

Comment by cyanydeez 18 hours ago

yeah, then theres prompt loading too.

but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.

Comment by upboundspiral 16 hours ago

with llama-cpp and offloading non-active experts (from MOE architecture) to cpu RAM, you can easily run 50 tok / s QWEN-3.6 35B on 8-12 GB of VRAM. KV cache is a few GB, experts are ~3-5 GB (assuming q8 quant from Unsloth for example).

You can scroll through r/localllama and find tons of people getting useable speeds out of Qwen 35B.

24 tok / second on an ancient 1080ti

https://old.reddit.com/r/LocalLLaMA/comments/1tcc7h5/24_toks...

100 tok / second on a 4070

https://old.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_tok...

Comment by wmf 18 hours ago

That just sounds like a 3090.

Comment by cyanydeez 16 hours ago

not at the vram sizes that control how much context to load; also, GPUs arn't as effiecient as direct inference.

Comment by wmf 14 hours ago

OK, B70.