Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs
Posted by dnhkng 2 hours ago
Comments
Comment by cootsnuck 3 minutes ago
Comment by tjwei 7 minutes ago
Comment by dnhkng 2 hours ago
The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.
The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.
Happy to answer questions.
Comment by rapatel0 48 minutes ago
Have you tried a simple inline loop over the duplicated layers? Would be interesting to see performance. Also, would be interesting to compare with a MOE model. See if these layers are acting like different agreeing "experts" or if there is reasoning happening in the latent space.
Comment by digdugdirk 16 minutes ago
Comment by dnhkng 5 minutes ago
It less 'tool', than an assorted set of scripts, tailored to my unusual hardware setup. But it should be easy to extend; I would have released this earlier but I had the (stupid) idea to 'write a paper' on this. Aiming for that delayed this a year. Blogs are the way to go (for me).
Comment by naasking 42 minutes ago
Pretty cool though. LLM brain surgery.
Comment by dnhkng 1 minute ago
I really think from the experiments that 'organs' (not sure what to term this), develop during massive pretraining. This also means maybe looping the entire models is actually not efficient. Maybe a better way is [linear input section -> loop 1 -> linear section -> loop 2 -> linear section -> ... -> loop n -> linear output]?
This would give 'organs' space to develop.
Comment by jauntywundrkind 24 minutes ago
Comment by WithinReason 36 minutes ago
Comment by dnhkng 9 minutes ago
I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way
Comment by WithinReason 5 minutes ago
Comment by tgw43279w 28 minutes ago
Comment by tgw43279w 1 hour ago
Comment by seeknotfind 17 minutes ago
Comment by dnhkng 12 minutes ago
I will make another post if the topic is popular; its pretty geeky though, even more than my usual blog posts...
Comment by blourvim 1 hour ago
This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?