Surpassing vLLM with a Generated Inference Stack
Posted by lukebechtel 10 hours ago
Comments
Comment by ntonozzi 6 hours ago
It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.
Comment by lukebechtel 6 hours ago
Comment by cermicelli 1 hour ago
This is plain bullshit.
Comment by rfw300 6 hours ago
Comment by lukebechtel 6 hours ago
We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.
Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.
Comment by rfw300 1 hour ago
Comment by storus 3 hours ago
Comment by lukebechtel 3 hours ago
The system started without paged attention, and recreated its own paged attention implementation automatically once it realized it was a bottleneck.
Pretty cool!
Comment by acuozzo 5 hours ago
Comment by lukebechtel 4 hours ago
We believe our improvements would hold on BF16, but let me check.