AutoKernel: Autoresearch for GPU Kernels
Posted by frozenseven 9 hours ago
Comments
Comment by ademeure 6 hours ago
I've been working on something somewhat similar over the last few weeks, but trying to be much more general and arguably over-engineered! I like the scope of this project, keeping it limited to Triton and specific kinds of kernels makes it quite simple and efficient.
I'm confused by the progress graph though; it looks like it's benchmarking a 4096x4096x4096 fp16 matmul rather than a full repo, and it claims a 1.31x improvement vs cuBLAS... while running at 187 TFLOPS which is 18.9% of peak utilization? cuBLAS definitely gets much closer to peak than that - most likely it's limited by CPU overhead or something else? Benchmarking is hard!
Either way I'm excited to see other people working on this, I think it's an extremely promising area over the next 6 months.
Comment by veselin 6 hours ago
I guess they can be a contributor there.
Comment by LuxBennu 3 hours ago
Comment by easygenes 4 hours ago
Comment by aviinuo 5 hours ago
Comment by sspehr 7 hours ago
Comment by NitpickLawyer 8 hours ago
For a bit of context, goog already did something like this two generations of models ago, as announced in this blog post[1] from May '25:
> AlphaEvolve is accelerating AI performance and research velocity. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini’s architecture by 23%, leading to a 1% reduction in Gemini's training time.
We are now seeing the same thing "at home", for any model. And with how RL heavy the new training runs have become, inference speedups will directly translate in faster training as well.
[1] - https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...
Comment by m3kw9 3 hours ago
Comment by zacklee1988 7 hours ago