Surpassing vLLM with a Generated Inference Stack

Posted by lukebechtel 10 hours ago

Comments

Comment by ntonozzi 6 hours ago

Why do they need to run benchmarks to confirm performance? Can't they run an example prompt and verify they get the exact same output token probabilities for all prompts? The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM.

It is also a bit weird that they are not incorporating speculative decoding, that seems like a critical performance optimization, especially for decode heavy workloads.

Comment by lukebechtel 6 hours ago

Yes, speculative decoding will make both us and VLLM faster, but we believe it would be a relatively even bump on both sides, so we didn't include it in this comparison. Worth another test!

Comment by cermicelli 1 hour ago

Dumb shit this says nothing you are saying x is better and there is no way to check or look into what it does and how it works or if it didn't just clone vllm code because why not atleast c compiler claude wrote was the verifiable kind of shit.

This is plain bullshit.

Comment by rfw300 6 hours ago

OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.

Comment by lukebechtel 6 hours ago

We validate with MMLU and Hellaswag presently, and are getting this independently verified by a 3rd party.

We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.

Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.

Comment by rfw300 1 hour ago

I've no problem with the intuition. But I would hope for a lot more focus in the marketing materials on proving the (statistical) correctness of the implementation. 15% better inference speed is not worth it to use a completely unknown inference engine not tested across a wide range of generation scenarios.

Comment by storus 3 hours ago

Does it support paged attention like vLLM though? Without that they will run into memory fragmentation quickly.

Comment by lukebechtel 3 hours ago

Yes, great question!

The system started without paged attention, and recreated its own paged attention implementation automatically once it realized it was a bottleneck.

Pretty cool!

Comment by acuozzo 5 hours ago

Luke: Do you have benchmarks for BF16?

Comment by lukebechtel 4 hours ago

Unfortunately, not at present; we went for FP8 because we believed it was generally the best tradeoff of quality and speed. Allowed faster iteration as well.

We believe our improvements would hold on BF16, but let me check.