CUDA-l2: Surpassing cuBLAS performance for matrix multiplication through RL
Posted by dzign 7 days ago
Comments
Comment by j2kun 7 days ago
Comment by Q6T46nT668w6i3m 7 days ago
Comment by AlexCoventry 7 days ago
Comment by hedgehog 7 days ago
Comment by alyxya 7 days ago
Comment by josephg 6 days ago
I think about this regularly when I compile C++ or rust using llvm. It’s an excellent compiler backend. It produces really good code. But it is incredibly slow, and for no good technical reason. Plenty of other similar compilers run circles around it.
Imagine an llvm rewrite by the people who made V8, or chrome or the unreal engine. Or the guy who made luajit or the Go compiler team. I’d be shocked if we didn’t see an order of magnitude speed up overnight. They’d need some leeway to redesign llvm IR of course. And it would take years to port all of llvm’s existing optimisations. But my computer can retire billions of operations per second. And render cyberpunk at 60fps. It shouldn’t take seconds of cpu time to compile a small program.
Comment by slashdave 7 days ago
Comment by CapsAdmin 7 days ago
The way I see it, mathematicians have been trying (and somewhat succeeding every 5~ years) to prove faster ways to do matrix multiplications since the 1970s. But this is only in theory.
If you want to implement the theory, you suddenly have many variables you need to take care of such as memory speed, cpu instructions, bit precision, etc. So in practice, an actual implementation of some theory likely have more room to improve. It is also likely that LLM's can help figure out how to write a more optimal implementation.
Comment by alyxya 7 days ago
Comment by stonogo 7 days ago
Comment by Bulat_Ziganshin 5 days ago
HGEMM means half-precision (i.e. FP16) general matrix multiplication
Comment by roflmaostc 6 days ago
Lol, this will be potentially much slower than using the general matmul kernel.
However, I like this kind of research because it really exploits specific hardware configurations and makes it measurable faster (unlike some theoretical matmul improvements). Code specialization is cheap, and if it saves in the order of a few %, it quickly reimburses its price, especially for important things like matmul.
Comment by konradha 6 days ago
Comment by bgwalter 7 days ago
Comment by krapht 7 days ago
Comment by Q6T46nT668w6i3m 7 days ago